Eigenvectors of some large sample covariance matrix ensembles.
We consider sample covariance matrices where is a real or complex matrix with i.i.d. entries with finite moment and is a positive definite matrix. In addition we assume that the spectral measure of almost surely converges to some limiting probability distribution as and We quantify the relationship between sample and population eigenvectors by studying the asymptotics of functionals of the type where is the identity matrix, is a bounded function and is a complex number. This is then used to compute the asymptotically optimal bias correction for sample eigenvalues, paving the way for a new generation of improved estimators of the covariance matrix and its inverse.
1 Introduction and Overview of the Main Results
1.1 Model and results
Consider independent samples , all of which are real or complex vectors. In this paper, we are interested in the large--limiting spectral properties of the sample covariance matrix
when we assume that the sample size satisfies as for some This framework is known as large-dimensional asymptotics. Throughout the paper, denotes the indicator function of a set, and we make the following assumptions: where
is a matrix of real or complex iid random variables with zero mean, unit variance, and absolute central moment bounded by a constant independent of and ;
the population covariance matrix is a -dimensional random Hermitian positive definite matrix independent of ;
is a system of eigenvalues of , and the empirical spectral distribution (e.s.d.) of the population covariance given by converges a.s. to a nonrandom limit at every point of continuity of . defines a probability distribution function, whose support is included in the compact interval with .
The aim of this paper is to investigate the asymptotic properties of the eigenvectors of such sample covariance matrices. In particular, we will quantify how the eigenvectors of the sample covariance matrix deviate from those of the population covariance matrix under large-dimensional asymptotics. This will enable us to characterize how the sample covariance matrix deviates as a whole (i.e. through its eigenvalues and its eigenvectors) from the population covariance matrix. Specifically, we will introduce bias-correction formulae for the eigenvalues of the sample covariance matrix that can lead, in future research, to improved estimators of the covariance matrix and its inverse. This will be developped in the discussion (Sections 1.2 and 1.3) following our main result Theorem 1.2 stated below.
Before exposing our results, we briefly review some known results about the spectral properties of sample covariance matrices under large-dimensional asymptotics.
In the whole paper we denote by a system of eigenvalues and orthonormal eigenvectors of the sample covariance matrix . Without loss of generality, we assume that the eigenvalues are sorted in decreasing order: . We also denote by a system of orthonormal eigenvectors of . Superscripts will be omitted when no confusion is possible.
First the asymptotic behavior of the eigenvalues is now quite well understood. The “global behavior” of the spectrum of for instance is characterized through the e.s.d., defined as: , . The e.s.d. is usually described through its Stieltjes transform. We recall that the Stieltjes transform of a nondecreasing function is defined by for all in , where The use of the Stieltjes transform is motivated by the following inversion formula: given any nondecreasing function , one has that , which holds if is continuous at and .
The first fundamental result concerning the asymptotic global behavior of the spectrum has been obtained by Marčenko and Pastur in . Their result has been later precised e.g. in [4, 14, 16, 29, 30]. In the next Theorem, we recall their result (which was actually proved in a more general setting than that exposed here) and quote the most recent version as given in .
Let where denotes the identity matrix.
Theorem 1.1 ().
Under Assumptions to , one has that for all
Furthermore, the e.s.d. of the sample covariance matrix given by converges a.s. to the nonrandom limit at all points of continuity of .
In addition,  show that the following limit exists :
They also prove that has a continuous derivative which is given by on .
More precisely, when exists for all , has a continuous derivative on all of , and is identically equal to zero in a neighborhood of .
When , the proportion of sample eigenvalues equal to zero is asymptotically . In this case,
it is convenient to introduce the e.s.d. ,
which is the limit of e.s.d. of the -dimensional matrix .
Then exists for all , has a continuous derivative on all of , and is identically equal to zero in a neighborhood of .
When is exactly equal to one, further complications arise because the density of sample eigenvalues can be unbounded in a neighborhood of zero; for this reason we will sometimes have to rule out the possibility that .
Further studies have complemented the a.s. convergence established by the Marčenko-Pastur theorem (see e.g. [1, 5, 6, 7, 9, 15, 23] and  for more references). The Marčenko-Pastur equation has also generated a considerable amount of interest in statistics [13, 19], finance [18, 17], signal processing , and other disciplines. We refer the interested reader to the recent book by Bai and Silverstein  for a throrough survey of this fast-growing field of research.
As we can gather from this brief review of the literature, the Marčenko-Pastur equation reveals much of the behavior of the eigenvalues of sample covariance matrices under large-dimensional asymptotics.
It is also of utmost interest to describe the asymptotic behavior of the eigenvectors. Such an issue is fundamental to statistics (for instance both eigenvalues and eigenvectors are of interest in Principal Components Analysis), communication theory (see e.g. ), wireless communication, finance. The reader is referred to , Section 1 for more detail and to  for a statistical approach to the problem and a detailed exposition of statistical applications.
Actually much less is known about eigenvectors of sample covariance matrices. In the special case where and the are i.i.d. standard (real or complex) Gaussian random variables, it is well-known that the matrix of sample eigenvectors is Haar distributed (on the orthogonal or unitary group). To our knowledge, these are the only ensembles for which the distribution of the eigenvectors is explicitly known. It has been conjectured that for a wide class of non Gaussian ensembles, the matrix of sample eigenvectors should be “asymptotically Haar distributed”, provided . Note that the notion “asymptotically Haar distributed” needs to be defined. This question has been investigated by , ,  followed by  and . Therein a random matrix is said to be asymptotically Haar distributed if is asymptotically uniformly distributed on the unit sphere for any non random unit vector .  and  are then able to prove the conjecture under various sets of assumptions on the ’s.
In the case where , much less is known (see  and ). One expects that the distribution of the eigenvectors is far from being rotation-invariant. This is precisely the aspect in which this paper is concerned.
In this paper, we present another approach to study eigenvectors of sample covariance matrices. Roughly speaking, we study “functionals” of the type
where is any real-valued univariate function satisfying suitable regularity conditions. By convention, is the matrix with the same eigenvectors as and with eigenvalues . These functionals are generalizations of the Stieltjes transform used in the Marčenko-Pastur equation. Indeed, one can rewrite the Stieltjes transform of the e.s.d. as:
The constant that appears at the end of Equation (4) can be interpreted as a weighting scheme placed on the population eigenvectors: specifically, it represents a flat weighting scheme. The generalization we here introduce puts the spotlight on how the sample covariance matrix relates to the population covariance matrix, or even any function of the population covariance matrix.
Our main result is given in the following Theorem.
Assume that conditions are satisfied. Let be a (real-valued) bounded function defined on with finitely many points of discontinuity. Then there exists a nonrandom function defined over such that converges a.s. to for all . Furthermore, is given by:
One can first observe that as we move from a flat weighting scheme of to any arbitrary weighting scheme , the integration kernel remains unchanged. Therefore, our Equation (5) generalizes Marčenko and Pastur’s foundational result. Actually the proof of Theorem 1.2 follows from some of the arguments used in  to derive the Marchenko-Pastur equation. This proof is postponed until Section 2.
The generalization of the Marčenko-Pastur equation we propose allows to consider a few unsolved problems regarding the overall relationship between sample and population covariance matrices.
Let us consider two of these problems, which are investigated in more detail in the two next subsections.
The first of these questions is: how do the eigenvectors of the sample covariance matrix deviate from those of the population covariance matrix? By injecting functions of the form into Equation (5), we quantify the asymptotic relationship between sample and population eigenvectors. This is developed in more detail in Section 1.2.
Another question is: how does the sample covariance matrix deviate from the population covariance matrix as a whole, and how can we modify it to bring it closer to the population covariance matrix? This is an important question in Statistics, where a covariance matrix estimator that improves upon the sample covariance matrix is sought. By injecting the function into Equation (5), we find the optimal asymptotic bias correction for the eigenvalues of the sample covariance matrix in Section 1.3. We also perform the same calculation for the inverse covariance matrix (an object of great interest in Econometrics and Finance), this time by taking .
This list is not intended to be exhaustive. Other applications may hopefully be extracted from our generalized Marčenko-Pastur equation.
1.2 Sample vs. Population Eigenvectors
As will be made more apparent in Equation (8) below, it is possible to quantify the asymptotic behavior of sample eigenvectors in the general case by selecting a function of the form in Equation (5). Let us briefly explain why.
First of all, note that each sample eigenvector lies in a space whose dimension is growing towards infinity. Therefore, the only way to know “where” it lies is to project it onto a known orthonormal basis that will serve as a reference grid. Given the nature of the problem, the most meaningful choice for this reference grid is the orthonormal basis formed by the population eigenvectors . Thus we are faced with the task of characterizing the asymptotic behavior of for all , i.e. the projection of the sample eigenvectors onto the population eigenvectors. Yet as every eigenvector is identified up to multiplication by a scalar of modulus one, the argument (angle) of is devoid of mathematical relevance. Therefore, we can focus instead on its square modulus without loss of information.
Another issue that arises is that of scaling. Indeed as
we study instead, so that its limit does not vanish under large- asymptotics.
The indexing of the eigenvectors also demands special attention as the dimension goes to infinity. We choose to use an indexation system where “eigenvalues serve as labels for eigenvectors”, that is is the eigenvector associated to the largest eigenvalue .
All these considerations lead us to introduce the following key object:
This bivariate function is right continuous with left-hand limits and nondecreasing in each of its arguments. It also verifies and . Therefore, it satisfies the properties of a bivariate cumulative distribution function.
Our function can be compared with the object introduced in : where is a sequence of nonrandom unit vectors satisfying the non-trivial condition This condition is specified so that projecting the sample eigenvectors onto effectively wipes out any signature of non-rotation-invariant behavior. The main difference is that projects the sample eigenvectors onto the population eigenvectors instead.
From we can extract precise information about the sample eigenvectors. The average of the quantities of interest over the sample (resp. population) eigenvectors associated with the sample (resp. population) eigenvalues lying in the interval (resp. ) is equal to:
whenever the denominator is strictly positive. Since and (resp. and ) can be chosen arbitrarily close to each other (as long as the average in Equation (7) exists), our goal of characterizing the behavior of sample eigenvectors would be achieved in principle by determining the asymptotic behavior of . This can be deduced from Theorem 1.2 thanks to the inversion formula for the Stieltjes transform: for all such that is continuous at
which holds in the special case where . We are now ready to state our second main result.
Assume that conditions hold true and let be defined by (6). Then there exists a nonrandom bivariate function such that at all points of continuity of . Furthermore, when , the function can be expressed as: where
and (resp. ) is the real (resp. imaginary) part of .
Equation (9) quantifies how the eigenvectors of the sample covariance matrix deviate from those of the population covariance matrix under large-dimensional asymptotics. The result is explicit as a function of .
To illustrate Theorem 1.3, we can pick any eigenvector of our choosing, for example the one that corresponds to the first (i.e. largest) eigenvalue, and plot how it projects onto the population eigenvectors (indexed by their corresponding eigenvalues). The resulting graph is shown in Figure 1.
This is a plot of as a function of , for fixed equal to the supremum of . It is the asymptotic equivalent to plotting as a function of . It looks like a density because, by construction, it must integrate to one. As soon as the sample size starts to drop below times the number of variables, we can see that the first sample eigenvector starts deviating quite strongly from the first population eigenvectors. This should have precautionary implications for Principal Component Analysis (PCA), where the number of variables is often so large that it is difficult to make the sample size more than ten times bigger.
Obviously, Equation (9) would enable us to draw a similar graph for any sample eigenvector (not just the first one), and for any and verifying the assumptions of Theorem 1.3. Preliminary investigations reveal some unexpected patterns. For example: one might have thought that the sample eigenvector associated with the median sample eigenvalue would be closest to the population eigenvector associated with the median population eigenvalue; but in general this is not true.
1.3 Asymptotically Optimal Bias Correction for the Sample Eigenvalues
We now bring the two preceding results together to quantify the relationship between the sample covariance matrix and the population covariance matrix as a whole. As will be made clear in Equation (12) below, this is achieved by selecting the function in Equation (5). The objective is to see how the sample covariance matrix deviates from the population covariance matrix, and how we can modify it to bring it closer to the population covariance matrix. The main problem with the sample covariance matrix is that its eigenvalues are too dispersed: the smallest ones are biased downwards, and the largest ones upwards. This is most easily visualized when the population covariance matrix is the identity, in which case the limiting spectral e.s.d. is known in closed form (see Figure 2).
We can see that the smallest and the largest sample eigenvalues are biased away from one, and that the bias decreases in . Therefore, a key concern in multivariate statistics is to find the asymptotically optimal bias correction for the eigenvalues of the sample covariance matrix. As this correction will tend to reduce the dispersion of the eigenvalues, it is often called a shrinkage formula.
Ledoit and Wolf  made some progress along this direction by finding the optimal linear shrinkage formula for the sample eigenvalues (projecting on the two-dimensional subspace spanned by and ). However, shrinking the eigenvalues is a highly nonlinear problem (as Figure 3 below will illustrate). Therefore, there is strong reason to believe that finding the optimal nonlinear shrinkage formula for the sample eigenvalues would lead to a covariance matrix estimator that further improves upon the Ledoit-Wolf estimator. Theorem 1.2 paves the way for such a development.
To see how, let us think of the problem of estimating in general terms. In order to construct an estimator of , we must in turn consider what the eigenvectors and the eigenvalues of this estimator should be. Let us consider the eigenvectors first. In the general case where we have no prior information about the orientation of the population eigenvectors, it is reasonable to require that the estimation procedure be invariant with respect to rotation by any -dimensional orthogonal matrix . If we rotate the variables by , then we would ask our estimator to also rotate by the same orthogonal matrix . The class of orthogonally invariant estimators of the covariance matrix is constituted of all the estimators that have the same eigenvectors as the sample covariance matrix (see , Lemma 5.3). Every rotation-invariant estimator of is thus of the form:
and where is the matrix whose column is the sample eigenvector . This is the class that we consider.
Our objective is to find the matrix in this class that is closest to the population covariance matrix. In order to measure distance, we choose the Frobenius norm, defined as: for any matrix . Thus we end up with the following optimization problem: . Elementary matrix algebra shows that its solution is:
The interpretation of is that it captures how the sample eigenvector relates to the population covariance matrix as a whole.
While does not constitute a bona fide estimator (because it depends on the unobservable ), new estimators that seek to improve upon the existing ones will need to get as close to as possible. This is exactly the path that led Ledoit and Wolf  to their improved covariance matrix estimator. Therefore, it is important, in the interest of developing a new and improved estimator, to characterize the asymptotic behavior of . The key object that will enable us to achieve this goal is the nondecreasing function defined by:
When all the sample eigenvalues are distinct, it is straightforward to recover the ’s from :
The asymptotic behavior of can be deduced from Theorem 1.2 in the special case where : for all such that continuous at
We are now ready to state our third main result.
Assume that conditions hold true and let be defined as in (10). There exists a nonrandom function defined over such that converges a.s. to for all . If in addition , then can be expressed as: , where
By Equation (11) the asymptotic quantity that corresponds to is , provided that corresponds to . Therefore, the way to get closest to the population covariance matrix (according to the Frobenius norm) would be to divide each sample eigenvalue by the correction factor . This is what we call the optimal nonlinear shrinkage formula or asymptotically optimal bias correction.111This approach cannot possibly generate a consistent estimator of the population covariance matrix according to the Frobenius norm when is finite. At best, it could generate a consistent estimator of the projection of the population covariance matrix onto the space of matrices that have the same eigenvectors as the sample covariance matrix. Figure 3 shows how much it differs from Ledoit and Wolf’s  optimal linear shrinkage formula.
In addition, when , the sample eigenvalues equal to zero need to be replaced by .
In a statistical context of estimation, and are not known, so they need to be replaced by and respectively, where is some estimator of the limiting p.d.f. of sample eigenvalues. Research is currently underway to prove that a covariance matrix estimator constructed in this manner has desirable properties under large-dimensional asymptotics.
A recent paper  introduced an algorithm for deducing the population eigenvalues from the sample eigenvalues using the Marčenko-Pastur equation. But our objective is quite different, as it is not the population eigenvalues that we seek, but instead the quantities , which represent the diagonal entries of the orthogonal projection (according to the Frobenius norm) of the population covariance matrix onto the space of matrices that have the same eigenvectors as the sample covariance matrix. Therefore the algorithm in  is better suited for estimating the population eigenvalues themselves, whereas our approach is better suited for estimating the population covariance matrix as a whole.
Monte-Carlo simulations indicate that applying this bias correction is highly beneficial, even in small samples. We ran 10,000 simulations based on the distribution of population eigenvalues that places mass at , mass at and mass at . We kept constant at 2 while increasing the number of variables from 5 to 100. For each set of simulations, we computed the Percentage Relative Improvement in Average Loss (PRIAL). The PRIAL of an estimator of is defined as
By construction, the PRIAL of the sample covariance matrix (resp. of ) is (resp. ), meaning no improvement (resp. meaning maximum attainable improvement). For each of the 10,000 Monte-Carlo simulations, we consider , which is the matrix obtained from the sample covariance matrix by keeping its eigenvectors and dividing its eigenvalue by the correction factor . The expected loss is estimated by computing its average across the 10,000 Monte-Carlo simulations. Figure 4 plots the PRIAL obtained in this way, that is by applying the optimal nonlinear shrinkage formula to the sample eigenvalues. We can see that, even with a modest sample size like , we already get of the maximum possible improvement.
A similar formula can be obtained for the purpose of estimating the inverse of the population covariance matrix. To this aim, we set in Equation (5) and define
Assume that conditions are satisfied. There exists a nonrandom function defined over , such that converges a.s. to for all . If in addition , then can be expressed as: , where
Therefore, the way to get closest to the inverse of the population covariance matrix (according to the Frobenius norm) would be to multiply the inverse of each sample eigenvalue by the correction factor . This represents the optimal nonlinear shrinkage formula (or asymptotically optimal bias correction) for the purpose of estimating the inverse covariance matrix. Again, in a statistical context of estimation, the unknown needs to be replaced by , where is some estimator of the limiting p.d.f. of sample eigenvalues. This question is investigated in some work under progress.
2 Proof of Theorem 1.2
The proof of Theorem 1.2 follows from an extension of the usual proof of the Marčenko-Pastur theorem (see e.g.  and ). The latter is based on the Stieltjes transform and, essentially, on a recursion formula. First, we slightly modify this proof to consider more general functionals for some polynomial functions . Then we use a standard approximation scheme to extend Theorem 1.2 to more general functions .
First we need to adapt a Lemma from Bai and Silverstein .
Let be a random vector with i.i.d. entries satisfying:
where the constant does not depend on . Let also be a given matrix. Then there exists a constant independent of , and such that:
Proof of Lemma 2.1
Next, we need to introduce some notation. We set and define for all and integer . Thus, if we take . In particular, . To avoid confusion, the dependency of most of the variables on will occasionally be dropped from the notation. All convergence statements will be as . Conditions are assumed to hold throughout.
One has that where:
Proof of Lemma 2.2
In the first part of the proof, we show that
Using the a.s. convergence of the Stieltjes transform , it is then easy to deduce the equation satisfied by in Lemma 2.2. Our proof closely follows some of the ideas of  and . Therein the convergence of the Stieltjes transform is investigated.
Let us define where is the column of . Then Using the identity , one deduces that
Define now for any integer
By the resolvent identity we deduce that
which finally gives that
Plugging the latter formula into (15), one can write that
as Furthermore, using Lemma 2.6 in Silverstein and Bai (1995), one also has that
where the error term is given by with
We will now use (18) and (17) to show that a.s. converges to as It is known that converges a.s. to the distribution given by the Marčenko-Pastur equation (and has no subsequence vaguely convergent to 0). It is proven in Silverstein and Bai (1995) that under these assumptions, there exists such that In particular, there exists such that
From this, we deduce that
Using (18) we also get that
We first consider . Thus one has that
We now turn to . Using the a.s. convergence (17), it is not hard to deduce that
This completes the proof of Lemma 2.2.
For every the limit exists and satisfies the recursion equation
Proof of Lemma 2.3
The proof is inductive, so we assume that formula (21) holds for any integer smaller than or equal to for some given integer . We start from the formula
Using once more the resolvent identity, one gets that
which yields that