The CGA is almost deterministic

# The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deteriministic

Percy Deift New York University  and  Thomas Trogdon University of California, Irvine
###### Abstract.

We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices where is and for . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.

###### 2010 Mathematics Subject Classification:
Primary: 65F10, 60B20
We are grateful for discussions with Elliot Paquette, Joel Tropp and Roman Vershynin that have greatly improved the paper. This work was supported in part by NSF DMS-1300965 (PD) and NSF DMS-1753185 (TT)

## 1. Introduction

The conjugate gradient algorithm (CGA) [HS52] is arguably the most effective iterative method from numerical linear algebra. In exact arithmetic, the algorithm requires at most iterations to solve a positive-definite linear system and it often requires many less iterations to compute a good approximate solution. It is exceedingly simple to implement and there are well-known error bounds available. And, despite the fact that the CGA is sensitive to round-off errors these error bounds still effectively hold for floating point arithmetic [Gre89]. While we present the algorithm in full below (see Algorithm 2.1), the variational characterization of the method is summarized as follows: Consider the linear system , . Given an initial guess , find the unique vector that satisfies

 ∥x−xk∥W=miny∈Xk∥y−x∥W, Xk=x0+span{r0,Wr0,…,Wk−1r0},∥y∥2W=y∗Wy,r0=b−Wx0.

At each step of the iteration one can easily construct and the algorithm itself computes , . One has to then determine a computable stopping criterion, and typically, the algorithm is halted when , for a chosen error tolerance

Here we focus on two main measures of the error, :

 ∥ek∥Wand∥rk∥2=∥ek∥W2,rk(W,b)=rk=b−Axk,

when . The associated halting times are

 (1.1) t(1)ϵ(W,b)=min{k:∥ek∥W<ϵ},t(2)ϵ(W,b)=min{k:∥%rk∥2<ϵ}.

We emphasize the importance of analyzing both quantities because is what is observed throughout the iteration and, of course, is the true error.

While our results (Theorems 3.1, 3.2, 3.3) also hold for complex Gaussian matrices matrices111We can also easily extend the results to the case of quarternion entries., we state some consequences for real matrices for simplicity. Assume

 (1.2) W=XX∗/m,

where is an matrix whose entries are iid standard normal random variables. This is the real Wishart distribution. Suppose further that for . Then if b is a random unit vector, independent of as

 ∥rk(W,b)∥2almost surely⟶dk/2.

If then as

 ∥ek(W,b)∥Walmost surely⟶dk/2√1−d.

Furthermore, for almost every choice of , with fixed, we prove that

 limn→∞ P(t(1)ϵ(W,b)=⌈2logϵ+log(1−d)logd⌉)=1,ϵ<(1−d)−1, limn→∞ P(t(2)ϵ(W,b)=⌈2logϵlogd⌉)=1,ϵ<1.

Therefore, the halting time becomes effectively deterministic. We also present estimates that demonstrate that the probability that the errors deviate from their means decays exponentially with respect . In the case , a consequence of our results is that for any fixed and ,

 limn→∞P(t(2)ϵ(W,b)>k)=1.
###### Remark 1.1.

It is important to point out that in (1.2) is not necessarily a near-identity matrix. Indeed, the eigenvalues typically lie in the interval

 [(1−√d)2,(1+√d)2],

and have an asymptotic density given by the famed Marchenko–Pastur law, see Definition 2.1.

Our proofs make critical use of the invariance of the Wishart distribution and the relation between Householder bidiagonalization and the Lanczos iteration. This allows one to use classical estimate on chi-distributed random variables in a crucial way. The specific tools and results we incorporate from random matrix theory include global eigenvalue estimates [RV10], the convergence of the empirical spectral measure [BMP07] and the central limit theorem for linear statistics [LP09].

The remainder of the paper is setup as follows. In Section 1.1 we compare our analysis with facts already known about the conjugate gradient algorithm. We also demonstrate our results with numerical examples. In Section 2 we introduce our random matrix ensembles, the basic definitions from random matrix theory and review the Householder bidiagonalization procedure applied to these ensembles. We also review the connections between the conjugate gradient algorithm, the Lanczos iteration and the Householder bidiagonalization procedure. In Section 3 we present our main theorems. In Section 4 we introduce the results from probability and random matrix theory that are required to prove our theorems. In Section 5 we give the proofs of the theorems.

### 1.1. Comparison and demonstration

We now give a demonstration and discussion of the results. In what follows denotes the sample average of a random variable using samples. We will refer to the matrix

 W=XX∗/m

where is an matrix, having iid entries, with equal probability, as the Bernoulli ensemble.

#### 1.1.1. A numerical demonstration

To demonstrate our main results, in Figure 2 we plot the following quantities as a function of for different values of

 ⟨∥ek(W,b)∥W⟩anddk/2√1−d

with error bars that indicate where % of the samples lie. In Figure 1 we plot the same statistics for

 ⟨∥rk(W,b)∥2⟩% compared withdk/2.

Both Figure 2 and Figure 1 demonstrate the concentration of the errors about their means. We demonstrate the limiting behavior of the halting times in Figure 3.

In all of these figures we have included computations for distributions of random matrices that are beyond the class for which our results apply. Nonetheless, it is clear that the behavior persists. This universality will be investigated in future work. Figure 1. A demonstration that ∥rk∥2 concentrates strongly around is mean, which is nearly equal to dk/2 (solid). This plot is for d=0.2. The dots give the sample mean over 20,000 samples and the error bars give the symmetric interval where 99.9% of the samples lie. It is clear that this interval shrinks rapidly as n increases. Figure 2. A demonstration that ∥ek∥W concentrates strongly around is mean, which is nearly equal to dk/2/√1−d (solid). This plot is for d=0.2. The dots give the sample mean over 20,000 samples and the error bars give the symmetric interval where 99.9% of the samples lie. It is clear that this interval shrinks rapidly as n increases. Figure 3. A demonstration that t(2)ϵ becomes almost deterministic as n increases for d=0.2 and β=1 using 20,000 samples. Each panel gives the empirical halting time distribution for the indicated value of n. The histogram is extremely concentrated.

#### 1.1.2. Relation to previous work

The classical error estimate for the CGA is [HS52, Gre89]

 (1.3) ∥ek(W,b)∥W≤2⎡⎣(√κ−1√κ+1)k+(√κ−1√κ+1)−k⎤⎦−1∥e0(W,b)∥W,

where is the condition number of . Here are the eigenvalues of . It is a classical result in random matrix theory [BY93] that the condition number of (1.2) converges almost surely to . Roughly, one then obtains

 ∥ek(W,b)∥W≲2[dk/2+d−k/2]−1∥e0(W,b)∥W,

which is often just simplified to

 ∥ek(W,b)∥W≲2dk/2∥e0(W,b)∥W.

This overestimates the actual error by just a factor of 2.

In [MT16], the authors used (1.3) and tail bounds on the condition number to estimate the halting times (1.1) in the case . A key observation was that the actual number of iterations appears to be of the same asymptotic order as the estimate obtained using (1.1). This is something that will indeed be true if the error estimate used decays exponentially and turns out to be an overestimate by a constant factor.

###### Remark 1.2.

Of particular interest is this case where depends on and as . For example, was seen in [DMT16, DMOT14] to produce universal fluctuations for the halting times. Similarly, one would want to treat the case as

###### Remark 1.3.

Our calculations in this work apply only to matrices with Gaussian entries. An important question, one of universality, is if our results hold if this assumption is relaxed. Indeed, one expects this to be true by the computations in Figures 2 and 5 and the wealth of theoretical results from random matrix theory [PY14, BKYY16, BMP07].

## 2. The bidiagonalization of Wishart matrices and invariance

###### Definition 2.1.

For set . Let be an matrix of iid standard normal random variables () or where and are independent copies of an matrix of iid standard normal random variables (). Then

 (2.1) Wn,β,d:=1βmXX∗

has the -Wishart distribution. The associated empirical spectral measure (ESM) is given by

 μn,β,d=1nn∑j=1δλj(n,β,d)

where are the eigenvalues of . Define the averaged EMS (or density of states) by

 (2.2) ∫f(λ)Eμn,β,d(dλ):=E(∫f(λ)μn,β,d(dλ))

for every222 denote continuous functions with compact support in . .

###### Definition 2.2.

Marchenko–Pastur law on is given by the density

 ρMP,d(x)=12πd√|(d+−x)(x−d−)|x\mathbbm1[d−,d+](x),d±=(1±√d)2.

The relation of the Marchenko–Pastur law to the eigenvalues of a Wishart matrix is given in the following section. But we demonstrate this relationship in Figure 4. Figure 4. A demonstration of how the Marchenko–Pastur law relates to the spectrum of a Wishart matrix for n=400. A histogram for the eigenvalues of one sampled matrix closely approximates the Marchenko–Pastur density.

The Wishart matrix the distribution is invariant under orthogonal () or unitary () conjugation. Using , This means that if is a random unitary matrix that is independent of then

 UWn,β,dU∗dist.=Wn,β,d.

For the Householder bidiagionalization procedure [TBI97] operates on on the left and the right with Householder reflections , so that

 (2.3) [Hn,β,d0]: =RnRn−1⋯R1X~R1~R2⋅~Rn (2.4) =⎡⎢ ⎢ ⎢ ⎢ ⎢⎣ζ110⋯0ζ21ζ220⋯0⋱⋱ζn,n−1ζnn0⋯0⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

where all entries are non-negative. Because of invariance, are independent -distributed random variables, see [DE02] and the references therein. Specifically,

 Hn,β,ddist.=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣χβmχβ(n−1)χβ(m−1)χβ(n−2)χβ(m−2)⋱⋱χβχβ(m−n+1)⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

where all entries are independent. Define the infinite matrix by the entry-wise limit

 (Td)ij:=limn→∞1βmE[(Hn,β,dH∗n,β,d)ij],1≤i,j.

Therefore

 Td =HdH∗d, Hd =⎡⎢ ⎢ ⎢ ⎢ ⎢⎣1√d1√d1⋱⋱⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

Lastly, define to be the upper-left submatrix of .

### 2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm

The conjugate gradient algorithm (CGA) for the iterative solution of

 Wx=b,W>0

is given by

Algorithm 1: Conjugate Gradient Algorithm is the initial guess. Set , . For Compute . Set . Set . Compute . Set .

The error at step is given by . Define the norm . A variational characterization of the CGA is that

 ∥ek∥W=∥p∗k(W)e0∥W=minpk∈P(0)k∥pk(W)e0∥W,

where . The unique minimizer in can be described the the Lanczos algorithm. The Lanczos algorithm is a tridiagonalization algorithm given by

Algorithm 2: Lanczos Iteration is the initial vector. Suppose Set For Compute . Set . Compute and if , set .

The Lanczos algorithm produces a tridiagonal matrix

 T=T(W,y1)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣α1β1β1α2⋱⋱⋱βn−1βn−1αn⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

and . We use to denote the upper-left subblock of . Then, it is well-known that

 (2.5) p∗k(λ)=det(Tk(W,r0∥r0∥)−λI)detTk(W,r0∥r0∥).

Then, write and

 ∥ek∥2Wℓ =∥Wℓ/2p∗k(W)e0∥2=∥Λℓ/2p∗k(Λ)U∗e0∥2=∥Λℓ/2p∗k(Λ)Λ−1U∗b0∥2 =n∑j=1λℓ−2jp∗k(λj)2ωj,ω=|U∗b0|2.

Now, consider the special case and , so that . We further analyze the relation with . The Lanczos algorithm gives the matrix representation of in the orthonormal basis found by applying the Gram–Schmidt process to the sequence

 y1,Wy1,…,Wn−1y1.

So, the first vector is , and so the first column of is . The main consequence of this is that that first components of the eigenvectors333This is true modulo permutations and normalizations. of are the same as those of .

Finally, we make a simple observation that the Householder bidiagonalization procedure [TBI97] applied to where (or Householder tridiagonalization applied to ) leaves the eigenvalues of unchanged and also leaves the first components of the eigenvectors unchanged. So, provided that Lanczos completes ( for ), the Householder bidiagonalization must produce . This is indeed true because a Jacobi matrix is uniquely defined by eigenvalues and first-components of eigenvectors [DLT85].

## 3. Main results

The proof of our main theorem is given in Section 5. The convention used in this paper is that and are fixed constants. The symbols with an assortment subscripts will be used to denoted constants and their (possible) dependencies. We suppress any dependence of these constants on but include dependence on , with a view to forthcoming work where we will allow to vary.

###### Theorem 3.1.

Assume the conjugate gradient algorithm is applied to solve where is a (possibly) random vector, independent of and . Let , be the associated error vectors.

1. For any fixed and there exists a constant such that

 E[∣∣∣∥ek∥2Wℓ−∫λℓ−2det(Tk,d−λI)2μMP,d(dλ)∣∣∣]≤Cℓ,d,klogn√n.
2. Furthermore

 P(∣∣∣∥ek∥2Wℓ−∫λℓ−2det(Tk,d−λI)2Eμn,β,d(dλ)∣∣∣>t)≤C′ℓ,d,kne−cℓ,d,knh(t),

for some constants and a non-decreasing function that satisfies for .

3. Lastly if and (1) and (2) hold.

###### Theorem 3.2.

In the setting of Theorem 3.1, for , and define

 (3.1) e2ℓ,k,d:=∫λℓ−2det(Tk,d−λI)2μMP(dλ).

Then as , . Furthermore,

 e21,k,d=dk1−d.

For , (3.1) holds if and for

 e22,k,d =dk, e23,k,d =dk{1+dk≥1,1k=0.
###### Corollary 3.2.1.

In the setting of Theorem 3.1, for and and

 ∥ek∥Wℓa.s.→eℓ,k,d, E[∣∣∥ek∥Wℓ−eℓ,k,d∣∣]≤Cℓ,d,klogn√n.

If this holds for .

###### Proof.

The first claim follows from the Borel–Cantelli Lemma. The second follows from the observation

 ∥ek∥2Wℓ−e2ℓ,k,d=(∥ek∥Wℓ−eℓ,k,d)(∥ek∥Wℓ+eℓ,k,d),

which gives

The last of our main results is almost just a corollary of the above theorems and it concerns halting times (i.e. runtimes or iteration counts, recall (1.1)):

 t(1)ϵ =t(1)ϵ(Wn,β,d,b)=min{k:∥e% k∥Wn,β,d<ϵ}, t(2)ϵ =t(2)ϵ(Wn,β,d,b)=min{k:∥r% k∥2<ϵ}.

Since converges almost surely to and converges almost surely to we produce the candidate limit halting times

 τ(1)ϵ(β,d) =⌈2logϵ+log(1−d)logd⌉,ϵ<(1−d)−1, τ(2)ϵ(β,d) =⌈2logϵlogd⌉,ϵ<1
###### Theorem 3.3.

In the setting of Theorem 3.1, , for suppose that for , , then

 limn→∞P(t(ℓ)ϵ(Wn,β,d,b)=τ(ℓ)ϵ(β,d))=1.

If , i.e , for then

 limn→∞P(t(ℓ)ϵ(Wn,β,d,b)=τ(ℓ)ϵ(β,d)   or   t(ℓ)ϵ(Wn,β,d,b)=τ(ℓ)ϵ(β,d)+1)=1.
###### Proof.

First assume for , and let . Note that if then

 e2ℓ,κ−1,d>ϵ2>e2ℓ,κ,d.

Then

 P(t(ℓ)ϵ≠τ(ℓ)ϵ(β,d))≤P(t(ℓ)ϵ<τ(ℓ)ϵ(β,d))+P(t(ℓ)ϵ>τ(ℓ)ϵ(β,d))

We estimate these two terms individually. First,

 P(t(ℓ)ϵ<τ(ℓ)ϵ(β,d)) =P(∥ek−1∥2Wℓ<ϵ2,k=τ(ℓ)ϵ(β,d))

For sufficiently large , by Lemma 4.6,

 ∣∣∣e2ℓ,k−1,d−∫λℓ−2det(Tk,d−λI)2Eμn,β,d(dλ)∣∣∣≤δ2

and for such a value of

 P(∥ek−1∥2Wℓ

by Theorem 3.1(2). The estimate for is analogous. ∎

###### Remark 3.4.

We conjecture that if , i.e , for then

 limn→∞P(t(ℓ)ϵ(Wn,β,d,b)=τ(ℓ)ϵ(β,d))=12=limn→∞P(t(ℓ)ϵ(Wn,β,d,b)=τ(ℓ)ϵ(β,d)+1).

Indeed Figure 5 indicates this is true because

 ∥rk∥22−dk

appears to be asymptotically normal with a variance that decays like . We note that this is related to, but not a consequence of, the central limit theorem for linear spectral statistics (CLT for LSS). For the CLT for LSS the variance decays as . In the case at hand, the fluctuations that occur in the random weights (see in (5.1)) and the fluctuations that occur in the random polynomial contribute to the variance on the order of . This conjecture will be resolved in a forthcoming publication. Figure 5. A demonstration that gk:=∥rk∥22−dk is appears asymptotically normal as n increases. The top row demonstrates this for case β=1 and the bottom row demonstrates this for the case where X11=±1 with equal probability (X still has iid entries), i.e. the Bernoulli Ensemble. Specifically, we plot histograms for gk/(⟨g2k⟩)1/2 against a standard Gaussian density (black).

## 4. Technical results from random matrix theory

###### Lemma 4.1.

Let be a chi distributed random variable with degrees of freedom. Then for any fixed positive integers and there exists such that

 E∣∣ ∣∣(χk√k)p−1∣∣ ∣∣2q≤Cq,pk−q.

Furthermore, for ,

 P(χk√k−1≥t)≤1+O(k−1)√2π∫∞√kte−x2/2dx.

For

 P(χk√k−1≤t)≤1+O(k−1)ϵ√2π∫∞√kte−x2/2dx.
###### Proof.

Because has a density given by

 xk−1e−x2/22k/2−1Γ(k/2)

we are led to analyze

 12k/2−1Γ(k/2)∫∞0∣∣ ∣∣(x√k)p−1∣∣ ∣∣2qxk−1e−x2/2dλ.

The result the follows by the change of variables and applying the method of steepest descent (Laplace’s method) for integrals along with Stirling’s approximation. For the second inequality one has to use

 (y+1)22−log(y+1)−1/2≥y22.

The last follows from

 12k/2−1Γ(k/2)∫√k(t+1)0xk−1e−x2/2dx ≤12k/2−1Γ(k/2)∫√k(t+1)0xk−1dx ≤1+O(k−1√2πkek/2(t+1)k,

and . ∎

A good reference for the next classical result is [Ver18, Section 2.8].

###### Theorem 4.2 (Bernstein’s inequality for sub-exponential random variables).

Let be a sequence of independent mean zero random variables satisfying

 ∥Xi∥ψ1:=inf{t>0:Eexp(|X|/t)≤2}.

Then for

and .

This theorem then gives the estimate on

 (4.1) P(∣∣∣χ2nβn−1∣∣∣≥t)≤2exp(−cnmin{β2t2K2,tK}).

We will use three elementary facts that are encapsulated in the following lemma.

###### Lemma 4.3.

Let be random variables and assume . The the following inequalities hold

1. where denotes the positive part and ,

2. ,    and

3. .

###### Lemma 4.4.

Suppose . Let be independent chi-distributed random variables with degrees of freedom. Define weights

 ωj=(χ(j)β)2∑k(χ(k)β)2.

Then the Kolmogorov–Smirnov distance

 dKS(μ,ν):=supx∈R|μ((−∞,x])−ν((−∞,x])|

of

 μn =∑jδλj,nωj,νn=1n∑jδλj,n,

satisfies

 E[dKS(μn,νn)] ≤Clogn√n,

and the tail estimate

 (4.2) P(dKS(μn,νn)≥t)≤C1ne−c1nβt2+C2ne−c2nβt

for absolute constants .

###### Proof.

First, it follows that

 dKS(μn,νn) =max