Number of relevant directions in Principal Component Analysis and Wishart random matrices.
We compute analytically, for large , the probability that a Wishart random matrix has eigenvalues exceeding a threshold , including its large deviation tails. This probability plays a benchmark role when performing the Principal Component Analysis of a large empirical dataset. We find that , where is the Dyson index of the ensemble and is a rate function that we compute explicitly in the full range and for any . The rate function displays a quadratic behavior modulated by a logarithmic singularity close to its minimum . This is shown to be a consequence of a phase transition in an associated Coulomb gas problem. The variance of the number of relevant components is also shown to grow universally (independent of as for large .
pacs:02.50.-r; 02.10.Yn; 24.60.-k
Large datasets, such as financial or climate data, typically contain a mixture of relevant fluctuations responsible for hidden inherent patterns and irrelevant minor fluctuations. Compressing the data by filtering out the irrelevant fluctuations, while retaining the relevant ones, is crucial for many practical applications. The most commonly used technique for such data compression is the ”Principal Component Analysis” (PCA) with an impressive list of applications including image processing [, , ], biological microarrays [, ], population genetics [, ], finance [, ], meteorology and oceanography []. The basic idea of PCA is rather simple. Consider for instance an experiment where a set of observables is recorded times. The collected data can be arranged into a rectangular matrix and adjusted to have zero mean. For example, might represent examination marks of the -th student () in the -th subject (physics, mathematics, chemistry, etc.), or the recorded temperatures of the -th largest American city on the -th measurement day. The product symmetric square matrix represents the empirical covariance (unnormalized) matrix of the data that encodes all correlations.
In PCA, one collects eigenvalues and eigenvectors of . The eigenvector (principal component) associated with the largest eigenvalue gives the direction along which the data are maximally scattered and correlated. The scatter and correlations progressively reduce as one considers lower and lower eigenvalues. Next one retains only the top eigenvalues (out of ) and their corresponding eigenvectors, while discarding all the lower eigenvalues and eigenvectors. Subsequently, the data can be re-expressed in a new basis with the following desirable features (i) the dimension of the new basis is smaller (i.e. redundant information has been wiped out), and (ii) the total variance of data captured along the new eigendirections is larger (i.e. only the most important patterns present in the initial data have been retained).
While this procedure is well defined in principle, one immediately faces a practical problem, namely: what is the optimal number of eigenvalues to retain in a given data set? Unfortunately, a sound prescription for the choice of is generally lacking, and one has to resort to any of the ad hoc stopping criteria available in the literature []. For instance, in the commonly used Kaiser-Guttman type rule [] one keeps the eigenvalues greater than a threshold value , where the threshold may represent, e.g, the average variance of the empirical dataset.
However, this simple operational rule has attracted some criticism [] since randomly generated data (with no underlying pattern whatsoever) may also produce eigenvalues exceeding any empirical threshold, just by statistical fluctuations. It is thus compelling to define a statistical criterion able to discriminate whether the observed number of eigenvalues exceeding a chosen threshold retain inherent patterns of the data or just reflect random noise.
A Bayesian approach provides a possible answer to this problem []. Suppose we observe a number of eigenvalues of exceeding a fixed threshold decided by some stopping rule. We need to estimate the probability that the data are extracted from a certain model (out of a set of possible ), given the observed event . According to Bayes’ rule:
where is the prior, i.e. the probability that the data come from model without the knowledge of the observed event , while is called the likelihood of drawing the value if the data are taken from the model . Assuming that one has some apriori knowledge of the priors, the most crucial ingredient in Bayesian analysis is the estimate of the likelihood for a model . The most natural ‘null’ model to compare data with, in our context, is the random model where the ’data’ are replaced by pure noise, say from a Gaussian distribution with zero mean and unit variance. In that case, the corresponding covariance matrix is called the Wishart matrix []. Thus, the natural and important question that one faces is: what is the likelihood of associated with the Wishart matrix model?
In this Letter, by suitably adapting a Coulomb gas method recently used [, ] to compute the distribution of the number of positive eigenvalues of Gaussian random matrices, we are able to compute analytically this likelihood for arbitrary and large . To make the dependence explicit, henceforth we use the notation where then denotes the full probability distribution of the number of eigenvalues of an Wishart matrix exceeding a fixed threshold . In addition to serving as a crucial benchmark for the Bayesian analysis of the number of retained components in PCA, we show that the distribution also has a very rich and beautiful structure.
Let us first summarize our main results. We consider Wishart matrices where is in general a rectangular matrix () with independent entries (real, complex or quaternions, labelled by the Dyson index ) drawn from a standard Gaussian distribution of zero mean and unit variance. We focus, for convenience, on the case when . The matrix has non-negative random eigenvalues. We compute the distribution for large where is the number of eigenvalues of exceeding the threshold . Setting , we show that behaves for large as []
where the rate function (independent of ) is computed analytically for arbitrary (see Eq. (18)) and plotted in Fig. 2 for . The rate function has a minimum (zero) at the critical value where . Thus the distribution is peaked around which is precisely its mean value . Around this critical value, the rate function is non-analytic and displays a quadratic behavior modulated by a universal (-independent) logarithmic singularity
The physical origin of this non-analytic behavior is traced back to a phase transition in the associated Coulomb gas problem when crosses the critical value . Inserting this expression in (2), one finds that, for small fluctuations of around its mean on a scale , the distribution has a Gaussian form, where the variance
This leading behavior is also universal, i.e., independent of and is the same as the variance of the number of positive eigenvalues in the Gaussian ensembles [, ]. We thus conclude that on a scale , the distribution is Gaussian with mean and variance in (4), but with non-Gaussian large deviation tails that are described by the general formulae (2) and (18).
We start by recalling some well known spectral properties of Wishart matrices (also known as Laguerre or chiral ensemble) defined earlier. The non-negative eigenvalues âof the Wishart matrix are distributed via the joint probability density law []
where the energy of a configuration is
Here, stands for the inverse temperature and for simplicity we focus here on almost-square matrices with (and thus ) of for large . This thermodynamical analogy, originally due to Dyson [], allows to treat the system of eigenvalues as a Coulomb gas, i.e., a fluid of charged particles confined to the positive half-line by two competing interactions: the external linear + logarithmic potential (the first two terms) in (7) tends to push the charges towards the origin, while the third term representing mutual logarithmic repulsion between any pair of charges tends to spread them apart. It is useful first to estimate the typical scale of an eigenvalue for large . The first two terms typically scale for large as and . In contrast, the pairwise Coulomb repulsion (the third term) typically scales as for large . Balancing the first and the third term indicates for large . Consequently, as long as , the second term becomes smaller () compared to the other two terms () and hence can be neglected. One then expects the average spectral density (normalized to unity) to have the scaling form for large : . The scaling function is the celebrated Marčenko-Pastur (MP) distribution on the compact support [].
Given the joint distribution in (5), our goal is to compute the statistics of denoting the number of eigenvalues greater than a fixed threshold value , i.e., where is the Heaviside step function. The average value of can be easily estimated from the MP spectral density, where . The full probability distribution of can be written as the -fold integral
which we evaluate next for large by extending the Coulomb
gas analogy mentioned above.
The evaluation of the -fold integral (8) in the large limit consists of the following steps: first, we introduce an integral representation for the function, . Then, we cast again the integrand in the Boltzmann form with , where can be interpreted as a Lagrange multiplier. Written in this form, the integral has again a natural thermodynamical interpretation as the canonical partition function of a Coulomb gas in equilibrium at inverse temperature . This time, however, in addition to the linear confinement and the logarithmic repulsion, the fluid particles (eigenvalues) are subjected to another (discontinuous) external potential. This external term involving in the new energy function has the effect of constraining a fraction of the fluid charges to the right of the point .
In the large limit, the Coulomb gas with discrete charges becomes a continuous gas which can be described by a continuum (normalized to unity) density function . Consequently, one can replace the original multidimensional integral in (8) by a functional integral over the space of . This procedure, originally introduced by Dyson [], has recently been used successfully in a number of different contexts (see [, , , , ] and references therein). Noting further that the density is expected to have the scaling form for large , each term in the energy is of the same order and is expressed as an integral over the function (see eq. (10) below). The scaled density satisfies the overall normalization condition and the constraint arising from the theta-function term in .
The probability (8) can now be rewritten as , where the numerator is the following functional integral over , supplemented by two additional integrals over auxiliary variables and enforcing the two constraints mentioned above:
The action is given by:
Eq. (9) is indeed amenable to a saddle point analysis:
where and is obtained from . This gives
for all where denotes the region when charges exist, i.e., is positive. Varying the action with respect to and just reproduces the two constraints. The function obtained in this way is the rescaled equilibrium density of a Coulomb fluid where a fraction of particles are constrained to lie to the right of a barrier at . It is easy to show [] that reduces to the unconstrained MP density when .
Thus, for large , the saddle point analysis precisely predicts the result in (2) with the rate function given by
It then remains to solve the saddle-point equation (12). Taking one more derivative with respect to ():
where denotes Cauchy’s principal value, supplemented with the constraints and .
To invert the singular integral equation (14) is a nontrivial challenge. One often needs to guess the solution and verify it a posteriori. To get a feeling how the solution may look like, we first did a Monte Carlo simulation of the Coulomb gas which brought out the following very interesting features. For and , the equilibrium charge density (i) generally consists of two disconnected supports: a blob of eigenvalues to the left of separated from a second blob of eigenvalues to the right of (see Fig. 1), and (ii) the actual shapes of the two blobs depend on whether (top panel of Fig. 1) or (bottom panel of Fig. 1), i.e. the fluid undergoes a phase transition as is varied across the critical point . When , the two blobs merge into a single support solution which is precisely the MP spectral density .
Extracting this two-support solution for a generic from the singular integral equation (14) poses the main technical challenge that we have succeeded in solving. There exists a well known Riemann-Hilbert method [] for solving such singular integral equations when the solution has a single support []. For two disjoint supports, this method [] was recently generalized in a number of different problems [, , , , ]. Adapting this generalized method to our problem (for details see []), we find that the two-support solution of (14) satisfying the two constraints is given explicitly by the expression
The edge point , depending on , is to be determined as the unique solution of the constraint , which can be rewritten in terms of generalized hypergeometric functions []. We have for , for and for . The density is supported on the union of two disconnected intervals for , (if ) or (if ) [see, e.g., the insets of Fig. 2]. For , the two intervals merge into a single one and as expected. In the extreme cases , the density has also a single support and its analytical form is known from earlier works []. The function is plotted in Fig. 1 for a specific choice , together with the result of Montecarlo simulations of the Coulomb gas, showing excellent agreement.
Next we substitute this explicit solution in (10) to evaluate the saddle-point action . For this, we need to compute the single and double integrals in (10). The double integral can be written in terms of a simple integral after multiplying (12) by and integrating over . This leads to:
where is the -th moment of the density. The first moment can be explicitly computed as . It remains to fix the two Lagrange multipliers and , assigning suitable values in Eq. (12). Skipping details [], it turns out that and can be expressed in terms of a pair of functions:
defined for . It can be shown that these functions are essentially the moment generating functions of [].
In summary, using a Coulomb gas approach we computed analytically the likelihood of retaining principal components if a completely random (pure noise) prior (Wishart random matrix) is assumed for the underlying data. This likelihood is an essential ingredient in Bayes’ formula (1), which in turn gives the probability that a certain number of significant eigenvalues is observed if the data represents pure noise. Thus the rate function computed in (18) serves indeed as a benchmark to gauge the significance of results obtained for empirical data. We also found an interesting phase transition in the associated Coloumb gas problem when two separated blobs of Coulomb charges merge onto a single one as a critical value is approached. This phase transition is responsible for the appearance of a logarithmic singularity in the rate function close to its minimum, which in turn leads to a highly universal logarithmic growth of the variance of with .
The present work can be developed in several directions: on one hand, it would be interesting to analyze the case and check if the universality found for still persists. On the other hand, improved stopping criteria based on null random matrix results are very much called for. The Coloumb gas method developed here for null random matrices may indeed be useful in that direction.
S.N.M. acknowledges the support of ANR grant 2011-BS04-013-01 WALKMAT.
- S. S. Wilks, Mathematical Statistics (John Wiley & Sons, New York, 1962).
- K. Fukunaga, Introduction to Statistical Pattern Recognition (Elsevier, New York, 1990).
- L. I. Smith, A tutorial on Principal Components Analysis (2002).
- N. Holter et al., Proc. Natl. Acad. Sci. USA 97, 8409 (2000).
- O. Alter et al., Proc. Natl. Acad. Sci. USA 97, 10101 (2000).
- L.-L. Cavalli-Sforza, P. Menozzi, and A. Piazza, The History and Geography of Human Genes (Princeton Univ. Press, 1994).
- J. Novembre and M. Stephens, Nature Genetics 40, 646 (2008).
- J.-P. Bouchaud and M. Potters, Theory of Financial Risks (Cambridge University Press, Cambridge, 2001).
- Z. Burda and J. Jurkiewicz, Physica A 344, 67 (2004); Z. Burda, J. Jurkiewicz, and B. Wacław, Acta Phys. Pol. B 36, 2641 (2005) and references therein.
- R. W. Preisendorfer, Principal Component Analysis in Meteorology and Oceanography (Elsevier, New York, 1988).
- D. A. Jackson, Ecology 74, 2204 (1993).
- L. Guttman, Psychometrika 19, 149 (1954); N. Cliff, Psychological Bulletin 103, 276 (1988).
- G. D. Grossman and D. M. Nickerson, Ecology 72, 341 (1991).
- J. O. Berger, Statistical Decision Theory and Bayesian Analysis (Springer Series in Statistics, 2nd ed., Springer-Verlag, 1985).
- J. Wishart, Biometrika 20, 32 (1928).
- S. N. Majumdar, C. Nadal, A. Scardicchio, and P. Vivo, Phys. Rev. Lett. 103, 220603 (2009).
- S. N. Majumdar, C. Nadal, A. Scardicchio, and P. Vivo, Phys. Rev. E 83, 041105 (2011).
- The symbol stands for the precise logarithmic law .
- A.T. James, Ann. Math. Statistics 35, 475 (1964).
- M. L. Mehta, Random Matrices (Academic Press, Boston, 1991).
- P. Vivo, S. N. Majumdar, and O. Bohigas, J. Phys. A: Math. Theor. 40, 4317 (2007).
- F. J. Dyson, J. Math. Phys. 3, 140 (1962).
- V. A. Marčenko and L. A. Pastur, Math. USSR-Sb 1, 457 (1967).
- D. S. Dean and S. N. Majumdar, Phys. Rev. Lett. 97, 160201 (2006).
- D. S. Dean and S. N. Majumdar, Phys. Rev. E 77, 41108 (2008).
- See Supplementary Material.
- E. Brezin, C. Itzykson, G. Parisi, and J. B. Zuber, Commun. Math. Phys. 59, 35 (1978).
- F. G. Tricomi, Integral Equations (Pure Appl. Math V, Interscience, London, 1957).
- P. Vivo, S. N. Majumdar, and O. Bohigas, Phys. Rev. Lett. 101, 216809 (2008).
- P. Vivo, S. N. Majumdar, and O. Bohigas, Phys. Rev. B 81, 104202 (2010).
- K. Damle, S. N. Majumdar, V. Tripathi, and P. Vivo, Phys. Rev. Lett. 107, 177206 (2011).