Analyzing Random Permutations for Cyclic Coordinate Descent

Analyzing Random Permutations for Cyclic Coordinate Descent

Stephen J. Wright Computer Sciences Department, University of Wisconsin-Madison, Madison, WI. swright@cs.wisc.edu  and  Ching-pei Lee Computer Sciences Department, University of Wisconsin-Madison, Madison, WI. ching-pei@cs.wisc.edu
July 3, 2019
Abstract.

We consider coordinate descent methods on convex quadratic problems, in which exact line searches are performed at each iteration. (This algorithm is identical to Gauss-Seidel on the equivalent symmetric positive definite linear system.) We describe a class of convex quadratic problems for which the random-permutations version of cyclic coordinate descent (RPCD) outperforms the standard cyclic coordinate descent (CCD) approach. A convergence analysis is developed to support the computational observations.

Key words and phrases:
Coordinate descent, Gauss-Seidel, randomization, permutations
2010 Mathematics Subject Classification:
Primary 65F10; Secondary 90C25, 68W20
This work was supported by NSF Award IIS-1447449, ONR Award N00014-13-1-0129, AFOSR Award FA9550-13-1-0138, and Subcontract 3F-30222 from Argonne National Laboratory.
copyright: ©: American Mathematical Society

1. Introduction

The coordinate descent (CD) approach for solving the problem

(1.1)

follow the framework of Algorithm 1. Following [2], we denote

(1.2)

where the single nonzero in appears in position . Epochs (indicated by the counter ) encompass cycles of inner iterations (indicated by ). At each iteration , one component of is selected for updating; a steplength parameter is applied to the negative gradient of with respect to this component.

  Set Choose ;
  for  do
     for  do
        Define ;
        Choose index ;
        Choose ;
        ;
     end for
  end for
Algorithm 1 Coordinate Descent

The choice of coordinate to be updated at inner iteration of epoch differs between variants of CD, as follows:

  • For “cyclic CD” (CCD), we choose .

  • For “fully randomized CD,” also known as “stochastic CD,” and abbreviated as RCD, we choose uniformly at random from and independently at each iteration.

  • For “random-permutations CD” (abbreviated as RPCD), we choose at the start of epoch to be a random permutation of the index set , then set to be the th entry in , for .

Note that denotes the value of after epochs.

We consider in this paper problems in which is a strictly convex quadratic, that is

(1.3)

with symmetric positive definite. As we saw in [2], even this restricted class of functions reveals significant diversity in convergence behavior between the three variants of CD described above. The minimizer of (1.3) is obviously . We assume that the choice of in Algorithm 1 is the exact minimizer of along the chosen coordinate direction. The resulting approach is thus equivalent to the Gauss-Seidel method applied to the linear system . The variants CCD, RCD, RPCD can be interpreted as different cyclic / randomized variants of Gauss-Seidel for .

In the RPCD variant, we can express a single epoch as follows. Letting be the permutation matrix corresponding to the permutation on this epoch, we split the symmetrically permuted Hessian into strictly triangular and diagonal parts as follows:

(1.4)

where is strictly lower triangular and is diagonal. We then define

(1.5)

so that the epoch indexed by can be written as follows:

(1.6)

where denotes the matrix corresponding to permutation . By recursing to the initial point , we obtain after epochs that

(1.7)

yielding a function value of

(1.8)

We analyze convergence in terms of the expected value of after epochs, with the expectation taken over the initial point and the permutations , that is,

(1.9)

1.1. Previous Work

This work is an adjunct to our paper [2], in which we show that a convex quadratic function that achieves close to worst-case convergence behavior for CCD (as shown by [4]) has faster asymptotic convergence behavior for RPCD, similar to the fully-random variant RCD. The function considered in [2] has the form (1.3) with

(1.10)

where . Salient properties of this matrix include:

  • It has eigenvalue replicated times, and a single dominant eigenvalue , and

  • it is invariant under symmetric permutations, that is, for all permutation matrices .

The latter property makes the analysis of RPCD much more straightforward than for general . It follows from (1.5) that , where , that is, is independent of . For the matrix (1.10), the expression (1.6) thus simplifies to

We refer to [2] for a discussion of prior related work on variants of coordinate descent.

1.2. Contributions

Computational experience reported in [5, 6] showed that for most convex quadratic problems (1.3), the convergence behaviors of the cyclic and randomized versions of CD are similar. The most notable class of exceptions was quadratic problems (1.3) in which has a single dominant eigenvalue. In most (but not all) such cases, the relative convergence behaviors of CCD (on the one hand) and RPCD and RCD (on the other hand) follow the behavior seen for the problem (1.10). That is, CCD converges slowly at a rate near its worst-case bound [4] while RPCD and RCD converge much more rapidly [2]. We therefore consider the following generalization of (1.10), in which all the small eigenvalues are identical, and the large eigenvalue has eigenvector :

(1.11)

(We assume that is a vector with elements of size .) This paper focuses on the case in which the components of are not too different in magnitude. The latter property is key to distinguishing between particular cases of (1.11) in which the randomized methods work well or poorly, and our analysis provides some insight on the reasons for this apparent anomaly. In Section 2, we show that the matrix (1.11) is within a symmetric diagonal scaling (with scaling matrix , where ) of the matrix defined by

(1.12) where , ,
, with and .

We show too that the scaling by does not greatly affect the behavior of coordinate descent, so there is little loss of generality between (1.11) with and (1.12). For matrices of the form (1.12) in (1.3), we demonstrate similar convergence behavior to what we proved in [2] for (1.10): RPCD and RCD have much faster asymptotic convergence behavior than CCD. By the generalization to (1.11), we provide a more complete explanation of the behavior observed in computational experiments.

1.3. Preliminaries

Applying to (1.12) the decomposition (1.4) into triangular and diagonal matrices, we obtain

(1.13)

where

(1.14)

Following (1.5), we have for that the iteration matrix is

(1.15)

Following (1.8), we have for that

(1.16)

and, as in (1.9), our interest is in the quantity

(1.17)

We adapt other notation from [2], namely, the matrices , defined as follows:

We have the the following recursive relationship between successive terms in the sequence , :

(1.18)

where we have dropped the the subscript on in the second equality, since the permutation matrices for each epoch are i.i.d.

By comparison with (1.8), we see that

(1.19)

If the starting vector is drawn from , we have

(1.20)

We can see immediately that defined by (1.12) is “sandwiched” between scalar multiples of two matrices of the form (1.10). We have

(1.21)

where and “” denotes element-wise inequality. This observation suggests similar behavior for RPCD to that observed for the matrices (1.10) in [2], but there is no obvious way to make use of this sandwich property in the analysis. The distinctiveness of the components of plays a key role in the convergence analysis, which shows that effects of persist through the epochs.

1.4. Remainder of the Paper

In Section 2, we explore the relationship between matrices of the form (1.12) and those of the form (1.11). Section 3 defines a sequence of matrices that approximately dominates , and that can be parametrized compactly. We analyze the relationship between two successive elements of the sequence , and develop an estimate of the asymptotic per-epoch rate of convergence. Section 4 discusses RCD and CCD variants for the problem (1.3), (1.12), while Section 5 reports computational experience with the three variants.

2. A Class of Quadratics with Similar Behavior to (1.12)

Given and , suppose that satisfies

(2.1)

Consider the matrix

(2.2)

Defining , we have

(2.3)

and note that the diagonal elements of are in the range . Thus we can write , where is diagonal with elements in , so in (2.3) has the form (1.12).

We verify in Appendix A that the iterates generated by Algorithm 1 for a given sequence of indices to (1.3) with from (2.2), with starting point and exact line search are isomorphic to the iterates generated by applying the same algorithm with the same index sequence to (1.3) with from (2.3), with starting point . In fact, we have for all , where is the iteration sequence for (2.2) and is the sequence for (2.3). Note that the function values coincide at each iteration, that is,

(2.4)

Suppose we are interested in the expected value of this function after epochs, given that . Because of the relationship , we can think of the vector as being drawn from the multivariate distribution . The expected function value can be derived by the same logic used to derive (1.20) until the final step, where we take the expectation with respect to . We thus have the following expression in place of (1.20):

Thus the expected value of (2.4) for is a weighted version of the trace that captures the expected value of for , where the weights are between and . Thus we expect to see similar asymptotic behavior for the quadratic objectives based on matrices (2.2) and (1.12), from starting points with the same distribution. The main contribution of this paper is therefore to extend the convergence analysis from the matrix (1.10) (as performed in [2]) to the case in which the single dominant eigenvector is replaced by a more general vector , provided that the elements of are not too disparate in magnitude. Since we allow in the analysis below, the bounds (2.1) are not too restrictive.

3. Epoch-Wise Convergence of Expected Function Value

We focus here on the operation (1.18), defining the relationship between two successive members of the sequence of expected coefficient matrices . We focus on a regime in which

(3.1)

The first inequality implies that we can replace factors in the analysis by . We expect that the main convergence results will continue to apply in a regime in which , which indeed is closer to the matrix studied in [2], but a different style of analysis will be needed, in which the remainder terms in the analysis are handled differently.

3.1. Compact Bound for

We will demonstrate that the matrix sequence from (1.18) is approximately dominated111Given two symmetric matrices and , we say that dominates if is positive semidefinite. by another sequence of positive definite matrices that can be represented as a four-term recurrence

(3.2)

where is the vector defined in (1.12) and the coefficients , , and are nonnegative and is nonpositive, for all . We set , with

(3.3)

Defining notation

(3.4)

it follows from (3.2) that

(3.5)

By substitution into (3.5), we have for the initial trace that

(3.6)

where is defined in (3.4). In Subsection 3.5, we derive an approximate linear recurrence relation for the sequence of coefficient quadruplets , ultimately showing that for the case in which and (their expected values if the elements of are drawn uniformly from ), the spectral radius of this linear recurrence is bounded approximately by

(3.7)

Thus the sequence of expectations of function values is approximately bounded by a sequence that converges linearly to zero at the epoch-wise rate (3.7), provided that in addition to (3.1), the condition holds.

3.2. Expected Decrease in the First Iteration

We show in the following lemma that the decrease in the very first iteration for all variants of CD can be explained by using an extension of the analysis in [2, Theorem 3.4]. Geometrically, this happens because the function (1.3), (1.12) increases rapidly along just one direction — the all-one direction — and comparitively gently in other direction. Thus an exact line search along any coordinate search direction will identify a point near the bottom of this multidimensional “trench.”

Lemma 3.1.

Consider solving (1.3) with the matrix defined in (1.12) using RCD or RPCD with exact line search. Suppose that the elements of are i.i.d. . Then after a single iteration, we have

where is the coordinate chosen for updating in the first iteration. In the regime (3.1), we have

(3.8)

For CCD, these same estimates hold, with the expectations with respect to omitted.

Proof.

Suppose that is the component chosen for updating in the first iteration, which is chosen uniformly at random from for RPCD and RCD. After a single step of CD, we have

Thus from (1.12) we have

(3.9)

Since and , we have

Thus by substitution into (3.9), we obtain

(3.10)

By our assumptions on , we have

(3.11a)
(3.11b)

The main result follows by taking expectation with respect to and in (3.10) and using the estimates (3.11).

The estimate (3.8) follows immediately from (3.1) when we capture only first-order terms in and .

The results for CCD can be obtained similarly, by omitting the expectations with respect to from (3.11). ∎

By comparing (3.8) and (3.6), we note that the first iteration reduces the expected function value from to — a dramatic reduction for typical parameter choices.

3.3. Definitions and Technical Results

We start by defining some useful quantities, drawing on [2], and proving several elementary results.

From (1.12) and (3.4), we have

(3.12)

From the definition of in (1.12), we have and . We use to denote the permutation of associated with the permutation matrix , so that for any vector , we have

(3.13)

We can see immediately that

(3.14a)
(3.14b)

A useful conditional probability is as follows:

(3.15)

This claim follows because contains zeros and a single , and the cannot appear in position (because ) but may appear in any other position with equal likelihood.

A quantity that appears frequently in the analysis is the matrix defined by

(3.16)

that is, the matrix of all zeros except for on the diagonal immediately above the main diagonal. Several identities follow immediately:

(3.17)

We also have

(3.18)

To verify this claim, note that the diagonals of are zero for all permutation matrices , while the off-diagonals are with equal probability. Thus the expected value of the off-diagonal elements is obtained by distributing the nonzeros in with equal weight among all off-diagonal elements, giving an expected value of for each of these elements, as in (3.18).

We have the following results about quantities involving .

Lemma 3.2.
(3.19a)
(3.19b)
Proof.

For (3.19a), we see that is a multiple of , and that .

For (3.19b), we use to denote the expectation with respect to index uniformly distributed over , and recall that denotes the permutation corresponding to . We have

from (3.13)
from (3.17)
from (3.15)

as required. ∎

As in [2], we define

(3.20)

We noted in [2] that

so using notation (3.16) and saving only terms, we have

(3.21)

We further define the epoch-iterative matrix , to be the matrix from (1.15) in the special case , that is

(3.22)

(The notation is appropriate because when , no longer depends on .) From the explicit expression for in [2] ( is denoted by in that paper), and using the notation (3.16), we obtain

(3.23)

(Note that this is a more refined estimate than what we would obtain by substituting (3.21) and (1.14) into (3.22).)

The following lemma provides a useful estimate of the matrix defined in (1.5).

Lemma 3.3.

Suppose that (3.1) holds. Then for defined by (1.5), we have

(3.24a)
(3.24b)
(3.24c)
Proof.

Starting from (1.15), we have

where we use for the third equality, and the definitions (3.22), (3.20), and (3.21) and assumption (3.1) for the later equalities. This proves (3.24a). We obtain (3.24b) and (3.24c) by using the estimates (3.21) and (3.23). ∎

3.4. Single-Epoch Analysis

In this section we analyze the change in each term in the expression (3.2) over a single epoch. We examine in turn the following terms:

  • the term: Lemma 3.4,

  • the term: Lemma 3.5,

  • the and terms: Lemma 3.6.

Proofs of the following technical results appear in Appendix B.

Lemma 3.4.

Suppose that (3.1) holds. We have

Lemma 3.5.

Suppose that (3.1) holds. We have

Lemma 3.6.

Suppose that (3.1) holds. For any , we have

(3.25)

When , we have

(3.26)

When , we have