Analyzing Random Permutations for Cyclic Coordinate Descent
We consider coordinate descent methods on convex quadratic problems, in which exact line searches are performed at each iteration. (This algorithm is identical to Gauss-Seidel on the equivalent symmetric positive definite linear system.) We describe a class of convex quadratic problems for which the random-permutations version of cyclic coordinate descent (RPCD) outperforms the standard cyclic coordinate descent (CCD) approach. A convergence analysis is developed to support the computational observations.
Key words and phrases:Coordinate descent, Gauss-Seidel, randomization, permutations
2010 Mathematics Subject Classification:Primary 65F10; Secondary 90C25, 68W20
The coordinate descent (CD) approach for solving the problem
where the single nonzero in appears in position . Epochs (indicated by the counter ) encompass cycles of inner iterations (indicated by ). At each iteration , one component of is selected for updating; a steplength parameter is applied to the negative gradient of with respect to this component.
The choice of coordinate to be updated at inner iteration of epoch differs between variants of CD, as follows:
For “cyclic CD” (CCD), we choose .
For “fully randomized CD,” also known as “stochastic CD,” and abbreviated as RCD, we choose uniformly at random from and independently at each iteration.
For “random-permutations CD” (abbreviated as RPCD), we choose at the start of epoch to be a random permutation of the index set , then set to be the th entry in , for .
Note that denotes the value of after epochs.
We consider in this paper problems in which is a strictly convex quadratic, that is
with symmetric positive definite. As we saw in , even this restricted class of functions reveals significant diversity in convergence behavior between the three variants of CD described above. The minimizer of (1.3) is obviously . We assume that the choice of in Algorithm 1 is the exact minimizer of along the chosen coordinate direction. The resulting approach is thus equivalent to the Gauss-Seidel method applied to the linear system . The variants CCD, RCD, RPCD can be interpreted as different cyclic / randomized variants of Gauss-Seidel for .
In the RPCD variant, we can express a single epoch as follows. Letting be the permutation matrix corresponding to the permutation on this epoch, we split the symmetrically permuted Hessian into strictly triangular and diagonal parts as follows:
where is strictly lower triangular and is diagonal. We then define
so that the epoch indexed by can be written as follows:
where denotes the matrix corresponding to permutation . By recursing to the initial point , we obtain after epochs that
yielding a function value of
We analyze convergence in terms of the expected value of after epochs, with the expectation taken over the initial point and the permutations , that is,
1.1. Previous Work
This work is an adjunct to our paper , in which we show that a convex quadratic function that achieves close to worst-case convergence behavior for CCD (as shown by ) has faster asymptotic convergence behavior for RPCD, similar to the fully-random variant RCD. The function considered in  has the form (1.3) with
where . Salient properties of this matrix include:
It has eigenvalue replicated times, and a single dominant eigenvalue , and
it is invariant under symmetric permutations, that is, for all permutation matrices .
The latter property makes the analysis of RPCD much more straightforward than for general . It follows from (1.5) that , where , that is, is independent of . For the matrix (1.10), the expression (1.6) thus simplifies to
We refer to  for a discussion of prior related work on variants of coordinate descent.
Computational experience reported in [5, 6] showed that for most convex quadratic problems (1.3), the convergence behaviors of the cyclic and randomized versions of CD are similar. The most notable class of exceptions was quadratic problems (1.3) in which has a single dominant eigenvalue. In most (but not all) such cases, the relative convergence behaviors of CCD (on the one hand) and RPCD and RCD (on the other hand) follow the behavior seen for the problem (1.10). That is, CCD converges slowly at a rate near its worst-case bound  while RPCD and RCD converge much more rapidly . We therefore consider the following generalization of (1.10), in which all the small eigenvalues are identical, and the large eigenvalue has eigenvector :
(We assume that is a vector with elements of size .) This paper focuses on the case in which the components of are not too different in magnitude. The latter property is key to distinguishing between particular cases of (1.11) in which the randomized methods work well or poorly, and our analysis provides some insight on the reasons for this apparent anomaly. In Section 2, we show that the matrix (1.11) is within a symmetric diagonal scaling (with scaling matrix , where ) of the matrix defined by
|(1.12)||where , ,|
|, with and .|
We show too that the scaling by does not greatly affect the behavior of coordinate descent, so there is little loss of generality between (1.11) with and (1.12). For matrices of the form (1.12) in (1.3), we demonstrate similar convergence behavior to what we proved in  for (1.10): RPCD and RCD have much faster asymptotic convergence behavior than CCD. By the generalization to (1.11), we provide a more complete explanation of the behavior observed in computational experiments.
Following (1.5), we have for that the iteration matrix is
Following (1.8), we have for that
and, as in (1.9), our interest is in the quantity
We adapt other notation from , namely, the matrices , defined as follows:
We have the the following recursive relationship between successive terms in the sequence , :
where we have dropped the the subscript on in the second equality, since the permutation matrices for each epoch are i.i.d.
By comparison with (1.8), we see that
If the starting vector is drawn from , we have
where and “” denotes element-wise inequality. This observation suggests similar behavior for RPCD to that observed for the matrices (1.10) in , but there is no obvious way to make use of this sandwich property in the analysis. The distinctiveness of the components of plays a key role in the convergence analysis, which shows that effects of persist through the epochs.
1.4. Remainder of the Paper
In Section 2, we explore the relationship between matrices of the form (1.12) and those of the form (1.11). Section 3 defines a sequence of matrices that approximately dominates , and that can be parametrized compactly. We analyze the relationship between two successive elements of the sequence , and develop an estimate of the asymptotic per-epoch rate of convergence. Section 4 discusses RCD and CCD variants for the problem (1.3), (1.12), while Section 5 reports computational experience with the three variants.
2. A Class of Quadratics with Similar Behavior to (1.12)
Given and , suppose that satisfies
Consider the matrix
Defining , we have
We verify in Appendix A that the iterates generated by Algorithm 1 for a given sequence of indices to (1.3) with from (2.2), with starting point and exact line search are isomorphic to the iterates generated by applying the same algorithm with the same index sequence to (1.3) with from (2.3), with starting point . In fact, we have for all , where is the iteration sequence for (2.2) and is the sequence for (2.3). Note that the function values coincide at each iteration, that is,
Suppose we are interested in the expected value of this function after epochs, given that . Because of the relationship , we can think of the vector as being drawn from the multivariate distribution . The expected function value can be derived by the same logic used to derive (1.20) until the final step, where we take the expectation with respect to . We thus have the following expression in place of (1.20):
Thus the expected value of (2.4) for is a weighted version of the trace that captures the expected value of for , where the weights are between and . Thus we expect to see similar asymptotic behavior for the quadratic objectives based on matrices (2.2) and (1.12), from starting points with the same distribution. The main contribution of this paper is therefore to extend the convergence analysis from the matrix (1.10) (as performed in ) to the case in which the single dominant eigenvector is replaced by a more general vector , provided that the elements of are not too disparate in magnitude. Since we allow in the analysis below, the bounds (2.1) are not too restrictive.
3. Epoch-Wise Convergence of Expected Function Value
We focus here on the operation (1.18), defining the relationship between two successive members of the sequence of expected coefficient matrices . We focus on a regime in which
The first inequality implies that we can replace factors in the analysis by . We expect that the main convergence results will continue to apply in a regime in which , which indeed is closer to the matrix studied in , but a different style of analysis will be needed, in which the remainder terms in the analysis are handled differently.
3.1. Compact Bound for
We will demonstrate that the matrix sequence from (1.18) is approximately dominated111Given two symmetric matrices and , we say that dominates if is positive semidefinite. by another sequence of positive definite matrices that can be represented as a four-term recurrence
where is the vector defined in (1.12) and the coefficients , , and are nonnegative and is nonpositive, for all . We set , with
it follows from (3.2) that
By substitution into (3.5), we have for the initial trace that
where is defined in (3.4). In Subsection 3.5, we derive an approximate linear recurrence relation for the sequence of coefficient quadruplets , ultimately showing that for the case in which and (their expected values if the elements of are drawn uniformly from ), the spectral radius of this linear recurrence is bounded approximately by
Thus the sequence of expectations of function values is approximately bounded by a sequence that converges linearly to zero at the epoch-wise rate (3.7), provided that in addition to (3.1), the condition holds.
3.2. Expected Decrease in the First Iteration
We show in the following lemma that the decrease in the very first iteration for all variants of CD can be explained by using an extension of the analysis in [2, Theorem 3.4]. Geometrically, this happens because the function (1.3), (1.12) increases rapidly along just one direction — the all-one direction — and comparitively gently in other direction. Thus an exact line search along any coordinate search direction will identify a point near the bottom of this multidimensional “trench.”
where is the coordinate chosen for updating in the first iteration. In the regime (3.1), we have
For CCD, these same estimates hold, with the expectations with respect to omitted.
Suppose that is the component chosen for updating in the first iteration, which is chosen uniformly at random from for RPCD and RCD. After a single step of CD, we have
Thus from (1.12) we have
Since and , we have
Thus by substitution into (3.9), we obtain
By our assumptions on , we have
The results for CCD can be obtained similarly, by omitting the expectations with respect to from (3.11). ∎
3.3. Definitions and Technical Results
We start by defining some useful quantities, drawing on , and proving several elementary results.
From the definition of in (1.12), we have and . We use to denote the permutation of associated with the permutation matrix , so that for any vector , we have
We can see immediately that
A useful conditional probability is as follows:
This claim follows because contains zeros and a single , and the cannot appear in position (because ) but may appear in any other position with equal likelihood.
A quantity that appears frequently in the analysis is the matrix defined by
that is, the matrix of all zeros except for on the diagonal immediately above the main diagonal. Several identities follow immediately:
We also have
To verify this claim, note that the diagonals of are zero for all permutation matrices , while the off-diagonals are with equal probability. Thus the expected value of the off-diagonal elements is obtained by distributing the nonzeros in with equal weight among all off-diagonal elements, giving an expected value of for each of these elements, as in (3.18).
We have the following results about quantities involving .
For (3.19a), we see that is a multiple of , and that .
As in , we define
We noted in  that
so using notation (3.16) and saving only terms, we have
We further define the epoch-iterative matrix , to be the matrix from (1.15) in the special case , that is
The following lemma provides a useful estimate of the matrix defined in (1.5).
3.4. Single-Epoch Analysis
In this section we analyze the change in each term in the expression (3.2) over a single epoch. We examine in turn the following terms:
Proofs of the following technical results appear in Appendix B.
Suppose that (3.1) holds. We have
Suppose that (3.1) holds. We have
Suppose that (3.1) holds. For any , we have
When , we have
When , we have