The conjugate gradient algorithm on wellconditioned Wishart matrices is almost deteriministic
Abstract.
We prove that the number of iterations required to solve a random positive definite linear system with the conjugate gradient algorithm is almost deterministic for large matrices. We treat the case of Wishart matrices where is and for . Precisely, we prove that for most choices of error tolerance, as the matrix increases in size, the probability that the iteration count deviates from an explicit deterministic value tends to zero. In addition, for a fixed iteration, we show that the norm of the error vector and the norm of the residual converge exponentially fast in probability, converge in mean and converge almost surely.
2010 Mathematics Subject Classification:
Primary: 65F10, 60B201. Introduction
The conjugate gradient algorithm (CGA) [HS52] is arguably the most effective iterative method from numerical linear algebra. In exact arithmetic, the algorithm requires at most iterations to solve a positivedefinite linear system and it often requires many less iterations to compute a good approximate solution. It is exceedingly simple to implement and there are wellknown error bounds available. And, despite the fact that the CGA is sensitive to roundoff errors these error bounds still effectively hold for floating point arithmetic [Gre89]. While we present the algorithm in full below (see Algorithm 2.1), the variational characterization of the method is summarized as follows: Consider the linear system , . Given an initial guess , find the unique vector that satisfies
At each step of the iteration one can easily construct and the algorithm itself computes , . One has to then determine a computable stopping criterion, and typically, the algorithm is halted when , for a chosen error tolerance
Here we focus on two main measures of the error, :
when . The associated halting times are
(1.1) 
We emphasize the importance of analyzing both quantities because is what is observed throughout the iteration and, of course, is the true error.
While our results (Theorems 3.1, 3.2, 3.3) also hold for complex Gaussian matrices matrices^{1}^{1}1We can also easily extend the results to the case of quarternion entries., we state some consequences for real matrices for simplicity. Assume
(1.2) 
where is an matrix whose entries are iid standard normal random variables. This is the real Wishart distribution. Suppose further that for . Then if b is a random unit vector, independent of as
If then as
Furthermore, for almost every choice of , with fixed, we prove that
Therefore, the halting time becomes effectively deterministic. We also present estimates that demonstrate that the probability that the errors deviate from their means decays exponentially with respect . In the case , a consequence of our results is that for any fixed and ,
Remark 1.1.
Our proofs make critical use of the invariance of the Wishart distribution and the relation between Householder bidiagonalization and the Lanczos iteration. This allows one to use classical estimate on chidistributed random variables in a crucial way. The specific tools and results we incorporate from random matrix theory include global eigenvalue estimates [RV10], the convergence of the empirical spectral measure [BMP07] and the central limit theorem for linear statistics [LP09].
The remainder of the paper is setup as follows. In Section 1.1 we compare our analysis with facts already known about the conjugate gradient algorithm. We also demonstrate our results with numerical examples. In Section 2 we introduce our random matrix ensembles, the basic definitions from random matrix theory and review the Householder bidiagonalization procedure applied to these ensembles. We also review the connections between the conjugate gradient algorithm, the Lanczos iteration and the Householder bidiagonalization procedure. In Section 3 we present our main theorems. In Section 4 we introduce the results from probability and random matrix theory that are required to prove our theorems. In Section 5 we give the proofs of the theorems.
1.1. Comparison and demonstration
We now give a demonstration and discussion of the results. In what follows denotes the sample average of a random variable using samples. We will refer to the matrix
where is an matrix, having iid entries, with equal probability, as the Bernoulli ensemble.
1.1.1. A numerical demonstration
To demonstrate our main results, in Figure 2 we plot the following quantities as a function of for different values of
with error bars that indicate where % of the samples lie. In Figure 1 we plot the same statistics for
Both Figure 2 and Figure 1 demonstrate the concentration of the errors about their means. We demonstrate the limiting behavior of the halting times in Figure 3.
In all of these figures we have included computations for distributions of random matrices that are beyond the class for which our results apply. Nonetheless, it is clear that the behavior persists. This universality will be investigated in future work.
1.1.2. Relation to previous work
The classical error estimate for the CGA is [HS52, Gre89]
(1.3) 
where is the condition number of . Here are the eigenvalues of . It is a classical result in random matrix theory [BY93] that the condition number of (1.2) converges almost surely to . Roughly, one then obtains
which is often just simplified to
This overestimates the actual error by just a factor of 2.
In [MT16], the authors used (1.3) and tail bounds on the condition number to estimate the halting times (1.1) in the case . A key observation was that the actual number of iterations appears to be of the same asymptotic order as the estimate obtained using (1.1). This is something that will indeed be true if the error estimate used decays exponentially and turns out to be an overestimate by a constant factor.
Remark 1.2.
Remark 1.3.
Our calculations in this work apply only to matrices with Gaussian entries. An important question, one of universality, is if our results hold if this assumption is relaxed. Indeed, one expects this to be true by the computations in Figures 2 and 5 and the wealth of theoretical results from random matrix theory [PY14, BKYY16, BMP07].
2. The bidiagonalization of Wishart matrices and invariance
Definition 2.1.
For set . Let be an matrix of iid standard normal random variables () or where and are independent copies of an matrix of iid standard normal random variables (). Then
(2.1) 
has the Wishart distribution. The associated empirical spectral measure (ESM) is given by
where are the eigenvalues of . Define the averaged EMS (or density of states) by
(2.2) 
for every^{2}^{2}2 denote continuous functions with compact support in . .
Definition 2.2.
Marchenko–Pastur law on is given by the density
The relation of the Marchenko–Pastur law to the eigenvalues of a Wishart matrix is given in the following section. But we demonstrate this relationship in Figure 4.
The Wishart matrix the distribution is invariant under orthogonal () or unitary () conjugation. Using , This means that if is a random unitary matrix that is independent of then
For the Householder bidiagionalization procedure [TBI97] operates on on the left and the right with Householder reflections , so that
(2.3)  
(2.4) 
where all entries are nonnegative. Because of invariance, are independent distributed random variables, see [DE02] and the references therein. Specifically,
where all entries are independent. Define the infinite matrix by the entrywise limit
Therefore
Lastly, define to be the upperleft submatrix of .
2.1. Householder bidiagonalization, the Lanczos iteration and the CG algorithm
The conjugate gradient algorithm (CGA) for the iterative solution of
is given by
Algorithm 1: Conjugate Gradient Algorithm is the initial guess. Set , . For Compute . Set . Set . Compute . Set .
The error at step is given by . Define the norm . A variational characterization of the CGA is that
where . The unique minimizer in can be described the the Lanczos algorithm. The Lanczos algorithm is a tridiagonalization algorithm given by
Algorithm 2: Lanczos Iteration is the initial vector. Suppose Set For Compute . Set . Compute and if , set .
The Lanczos algorithm produces a tridiagonal matrix
and . We use to denote the upperleft subblock of . Then, it is wellknown that
(2.5) 
Then, write and
Now, consider the special case and , so that . We further analyze the relation with . The Lanczos algorithm gives the matrix representation of in the orthonormal basis found by applying the Gram–Schmidt process to the sequence
So, the first vector is , and so the first column of is . The main consequence of this is that that first components of the eigenvectors^{3}^{3}3This is true modulo permutations and normalizations. of are the same as those of .
Finally, we make a simple observation that the Householder bidiagonalization procedure [TBI97] applied to where (or Householder tridiagonalization applied to ) leaves the eigenvalues of unchanged and also leaves the first components of the eigenvectors unchanged. So, provided that Lanczos completes ( for ), the Householder bidiagonalization must produce . This is indeed true because a Jacobi matrix is uniquely defined by eigenvalues and firstcomponents of eigenvectors [DLT85].
3. Main results
The proof of our main theorem is given in Section 5. The convention used in this paper is that and are fixed constants. The symbols with an assortment subscripts will be used to denoted constants and their (possible) dependencies. We suppress any dependence of these constants on but include dependence on , with a view to forthcoming work where we will allow to vary.
Theorem 3.1.
Assume the conjugate gradient algorithm is applied to solve where is a (possibly) random vector, independent of and . Let , be the associated error vectors.

For any fixed and there exists a constant such that

Furthermore
for some constants and a nondecreasing function that satisfies for .

Lastly if and (1) and (2) hold.
Theorem 3.2.
Corollary 3.2.1.
Proof.
The first claim follows from the Borel–Cantelli Lemma. The second follows from the observation
which gives
∎
The last of our main results is almost just a corollary of the above theorems and it concerns halting times (i.e. runtimes or iteration counts, recall (1.1)):
Since converges almost surely to and converges almost surely to we produce the candidate limit halting times
Theorem 3.3.
Proof.
Remark 3.4.
We conjecture that if , i.e , for then
Indeed Figure 5 indicates this is true because
appears to be asymptotically normal with a variance that decays like . We note that this is related to, but not a consequence of, the central limit theorem for linear spectral statistics (CLT for LSS). For the CLT for LSS the variance decays as . In the case at hand, the fluctuations that occur in the random weights (see in (5.1)) and the fluctuations that occur in the random polynomial contribute to the variance on the order of . This conjecture will be resolved in a forthcoming publication.
4. Technical results from random matrix theory
Lemma 4.1.
Let be a chi distributed random variable with degrees of freedom. Then for any fixed positive integers and there exists such that
Furthermore, for ,
For
Proof.
Because has a density given by
we are led to analyze
The result the follows by the change of variables and applying the method of steepest descent (Laplace’s method) for integrals along with Stirling’s approximation. For the second inequality one has to use
The last follows from
and . ∎
A good reference for the next classical result is [Ver18, Section 2.8].
Theorem 4.2 (Bernstein’s inequality for subexponential random variables).
Let be a sequence of independent mean zero random variables satisfying
Then for
and .
This theorem then gives the estimate on
(4.1) 
We will use three elementary facts that are encapsulated in the following lemma.
Lemma 4.3.
Let be random variables and assume . The the following inequalities hold

where denotes the positive part and ,

, and

.
Lemma 4.4.
Suppose . Let be independent chidistributed random variables with degrees of freedom. Define weights
Then the Kolmogorov–Smirnov distance
of
satisfies
and the tail estimate
(4.2) 
for absolute constants .
Proof.
First, it follows that