The CGA is almost deterministic

The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deteriministic

Percy Deift New York University deift@cims.nyu.edu  and  Thomas Trogdon University of California, Irvine ttrogdon@math.uci.edu
Abstract.

We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices where is and for . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.

2010 Mathematics Subject Classification:
Primary: 65F10, 60B20
We are grateful for discussions with Elliot Paquette, Joel Tropp and Roman Vershynin that have greatly improved the paper. This work was supported in part by NSF DMS-1300965 (PD) and NSF DMS-1753185 (TT)

1. Introduction

The conjugate gradient algorithm (CGA) [HS52] is arguably the most effective iterative method from numerical linear algebra. In exact arithmetic, the algorithm requires at most iterations to solve a positive-definite linear system and it often requires many less iterations to compute a good approximate solution. It is exceedingly simple to implement and there are well-known error bounds available. And, despite the fact that the CGA is sensitive to round-off errors these error bounds still effectively hold for floating point arithmetic [Gre89]. While we present the algorithm in full below (see Algorithm 2.1), the variational characterization of the method is summarized as follows: Consider the linear system , . Given an initial guess , find the unique vector that satisfies

At each step of the iteration one can easily construct and the algorithm itself computes , . One has to then determine a computable stopping criterion, and typically, the algorithm is halted when , for a chosen error tolerance

Here we focus on two main measures of the error, :

when . The associated halting times are

(1.1)

We emphasize the importance of analyzing both quantities because is what is observed throughout the iteration and, of course, is the true error.

While our results (Theorems 3.1, 3.2, 3.3) also hold for complex Gaussian matrices matrices111We can also easily extend the results to the case of quarternion entries., we state some consequences for real matrices for simplicity. Assume

(1.2)

where is an matrix whose entries are iid standard normal random variables. This is the real Wishart distribution. Suppose further that for . Then if b is a random unit vector, independent of as

If then as

Furthermore, for almost every choice of , with fixed, we prove that

Therefore, the halting time becomes effectively deterministic. We also present estimates that demonstrate that the probability that the errors deviate from their means decays exponentially with respect . In the case , a consequence of our results is that for any fixed and ,

Remark 1.1.

It is important to point out that in (1.2) is not necessarily a near-identity matrix. Indeed, the eigenvalues typically lie in the interval

and have an asymptotic density given by the famed Marchenko–Pastur law, see Definition 2.1.

Our proofs make critical use of the invariance of the Wishart distribution and the relation between Householder bidiagonalization and the Lanczos iteration. This allows one to use classical estimate on chi-distributed random variables in a crucial way. The specific tools and results we incorporate from random matrix theory include global eigenvalue estimates [RV10], the convergence of the empirical spectral measure [BMP07] and the central limit theorem for linear statistics [LP09].

The remainder of the paper is setup as follows. In Section 1.1 we compare our analysis with facts already known about the conjugate gradient algorithm. We also demonstrate our results with numerical examples. In Section 2 we introduce our random matrix ensembles, the basic definitions from random matrix theory and review the Householder bidiagonalization procedure applied to these ensembles. We also review the connections between the conjugate gradient algorithm, the Lanczos iteration and the Householder bidiagonalization procedure. In Section 3 we present our main theorems. In Section 4 we introduce the results from probability and random matrix theory that are required to prove our theorems. In Section 5 we give the proofs of the theorems.

1.1. Comparison and demonstration

We now give a demonstration and discussion of the results. In what follows denotes the sample average of a random variable using samples. We will refer to the matrix

where is an matrix, having iid entries, with equal probability, as the Bernoulli ensemble.

1.1.1. A numerical demonstration

To demonstrate our main results, in Figure 2 we plot the following quantities as a function of for different values of

with error bars that indicate where % of the samples lie. In Figure 1 we plot the same statistics for

Both Figure 2 and Figure 1 demonstrate the concentration of the errors about their means. We demonstrate the limiting behavior of the halting times in Figure 3.

In all of these figures we have included computations for distributions of random matrices that are beyond the class for which our results apply. Nonetheless, it is clear that the behavior persists. This universality will be investigated in future work.

\setkeys

Ginwidth=.9\OVP@calcBE, BE, 

Figure 1. A demonstration that concentrates strongly around is mean, which is nearly equal to (solid). This plot is for . The dots give the sample mean over samples and the error bars give the symmetric interval where of the samples lie. It is clear that this interval shrinks rapidly as increases.
\setkeys

Ginwidth=.9\OVP@calcBE, BE, 

Figure 2. A demonstration that concentrates strongly around is mean, which is nearly equal to (solid). This plot is for . The dots give the sample mean over samples and the error bars give the symmetric interval where of the samples lie. It is clear that this interval shrinks rapidly as increases.
\setkeys

Ginwidth=\OVP@calc

Figure 3. A demonstration that becomes almost deterministic as increases for and using samples. Each panel gives the empirical halting time distribution for the indicated value of . The histogram is extremely concentrated.

1.1.2. Relation to previous work

The classical error estimate for the CGA is [HS52, Gre89]

(1.3)

where is the condition number of . Here are the eigenvalues of . It is a classical result in random matrix theory [BY93] that the condition number of (1.2) converges almost surely to . Roughly, one then obtains

which is often just simplified to

This overestimates the actual error by just a factor of 2.

In [MT16], the authors used (1.3) and tail bounds on the condition number to estimate the halting times (1.1) in the case . A key observation was that the actual number of iterations appears to be of the same asymptotic order as the estimate obtained using (1.1). This is something that will indeed be true if the error estimate used decays exponentially and turns out to be an overestimate by a constant factor.

Remark 1.2.

Of particular interest is this case where depends on and as . For example, was seen in [DMT16, DMOT14] to produce universal fluctuations for the halting times. Similarly, one would want to treat the case as

Remark 1.3.

Our calculations in this work apply only to matrices with Gaussian entries. An important question, one of universality, is if our results hold if this assumption is relaxed. Indeed, one expects this to be true by the computations in Figures 2 and 5 and the wealth of theoretical results from random matrix theory [PY14, BKYY16, BMP07].

2. The bidiagonalization of Wishart matrices and invariance

Definition 2.1.

For set . Let be an matrix of iid standard normal random variables () or where and are independent copies of an matrix of iid standard normal random variables (). Then

(2.1)

has the -Wishart distribution. The associated empirical spectral measure (ESM) is given by

where are the eigenvalues of . Define the averaged EMS (or density of states) by

(2.2)

for every222 denote continuous functions with compact support in . .

Definition 2.2.

Marchenko–Pastur law on is given by the density

The relation of the Marchenko–Pastur law to the eigenvalues of a Wishart matrix is given in the following section. But we demonstrate this relationship in Figure 4.

\setkeys

Ginwidth=.9\OVP@calcEigenvaluesEigenvalues

Figure 4. A demonstration of how the Marchenko–Pastur law relates to the spectrum of a Wishart matrix for . A histogram for the eigenvalues of one sampled matrix closely approximates the Marchenko–Pastur density.

The Wishart matrix the distribution is invariant under orthogonal () or unitary () conjugation. Using , This means that if is a random unitary matrix that is independent of then

For the Householder bidiagionalization procedure [TBI97] operates on on the left and the right with Householder reflections , so that

(2.3)
(2.4)

where all entries are non-negative. Because of invariance, are independent -distributed random variables, see [DE02] and the references therein. Specifically,

where all entries are independent. Define the infinite matrix by the entry-wise limit

Therefore

Lastly, define to be the upper-left submatrix of .

2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm

The conjugate gradient algorithm (CGA) for the iterative solution of

is given by

Algorithm 1: Conjugate Gradient Algorithm is the initial guess. Set , . For Compute . Set . Set . Compute . Set .

The error at step is given by . Define the norm . A variational characterization of the CGA is that

where . The unique minimizer in can be described the the Lanczos algorithm. The Lanczos algorithm is a tridiagonalization algorithm given by

Algorithm 2: Lanczos Iteration is the initial vector. Suppose Set For Compute . Set . Compute and if , set .

The Lanczos algorithm produces a tridiagonal matrix

and . We use to denote the upper-left subblock of . Then, it is well-known that

(2.5)

Then, write and

Now, consider the special case and , so that . We further analyze the relation with . The Lanczos algorithm gives the matrix representation of in the orthonormal basis found by applying the Gram–Schmidt process to the sequence

So, the first vector is , and so the first column of is . The main consequence of this is that that first components of the eigenvectors333This is true modulo permutations and normalizations. of are the same as those of .

Finally, we make a simple observation that the Householder bidiagonalization procedure [TBI97] applied to where (or Householder tridiagonalization applied to ) leaves the eigenvalues of unchanged and also leaves the first components of the eigenvectors unchanged. So, provided that Lanczos completes ( for ), the Householder bidiagonalization must produce . This is indeed true because a Jacobi matrix is uniquely defined by eigenvalues and first-components of eigenvectors [DLT85].

3. Main results

The proof of our main theorem is given in Section 5. The convention used in this paper is that and are fixed constants. The symbols with an assortment subscripts will be used to denoted constants and their (possible) dependencies. We suppress any dependence of these constants on but include dependence on , with a view to forthcoming work where we will allow to vary.

Theorem 3.1.

Assume the conjugate gradient algorithm is applied to solve where is a (possibly) random vector, independent of and . Let , be the associated error vectors.

  1. For any fixed and there exists a constant such that

  2. Furthermore

    for some constants and a non-decreasing function that satisfies for .

  3. Lastly if and (1) and (2) hold.

Theorem 3.2.

In the setting of Theorem 3.1, for , and define

(3.1)

Then as , . Furthermore,

For , (3.1) holds if and for

Corollary 3.2.1.

In the setting of Theorem 3.1, for and and

If this holds for .

Proof.

The first claim follows from the Borel–Cantelli Lemma. The second follows from the observation

which gives

The last of our main results is almost just a corollary of the above theorems and it concerns halting times (i.e. runtimes or iteration counts, recall (1.1)):

Since converges almost surely to and converges almost surely to we produce the candidate limit halting times

Theorem 3.3.

In the setting of Theorem 3.1, , for suppose that for , , then

If , i.e , for then

Proof.

First assume for , and let . Note that if then

Then

We estimate these two terms individually. First,

For sufficiently large , by Lemma 4.6,

and for such a value of

by Theorem 3.1(2). The estimate for is analogous. ∎

Remark 3.4.

We conjecture that if , i.e , for then

Indeed Figure 5 indicates this is true because

appears to be asymptotically normal with a variance that decays like . We note that this is related to, but not a consequence of, the central limit theorem for linear spectral statistics (CLT for LSS). For the CLT for LSS the variance decays as . In the case at hand, the fluctuations that occur in the random weights (see in (5.1)) and the fluctuations that occur in the random polynomial contribute to the variance on the order of . This conjecture will be resolved in a forthcoming publication.

\setkeys

Ginwidth=.8\OVP@calcBEBE

Normalized fluctuations

Figure 5. A demonstration that is appears asymptotically normal as increases. The top row demonstrates this for case and the bottom row demonstrates this for the case where with equal probability ( still has iid entries), i.e. the Bernoulli Ensemble. Specifically, we plot histograms for against a standard Gaussian density (black).

4. Technical results from random matrix theory

Lemma 4.1.

Let be a chi distributed random variable with degrees of freedom. Then for any fixed positive integers and there exists such that

Furthermore, for ,

For

Proof.

Because has a density given by

we are led to analyze

The result the follows by the change of variables and applying the method of steepest descent (Laplace’s method) for integrals along with Stirling’s approximation. For the second inequality one has to use

The last follows from

and . ∎

A good reference for the next classical result is [Ver18, Section 2.8].

Theorem 4.2 (Bernstein’s inequality for sub-exponential random variables).

Let be a sequence of independent mean zero random variables satisfying

Then for

and .

This theorem then gives the estimate on

(4.1)

We will use three elementary facts that are encapsulated in the following lemma.

Lemma 4.3.

Let be random variables and assume . The the following inequalities hold

  1. where denotes the positive part and ,

  2. ,    and

  3. .

Lemma 4.4.

Suppose . Let be independent chi-distributed random variables with degrees of freedom. Define weights

Then the Kolmogorov–Smirnov distance

of

satisfies

and the tail estimate

(4.2)

for absolute constants .

Proof.

First, it follows that