Large deviations of spread measures for Gaussian matrices

Large deviations of spread measures for Gaussian matrices

Abstract

For a large Gaussian matrix, we compute the joint statistics, including large deviation tails, of generalized and total variance - the scaled log-determinant and trace of the corresponding covariance matrix. Using a Coulomb gas technique, we find that the Laplace transform of their joint distribution decays for large (with fixed) as , where is the Dyson index of the ensemble and is a -independent large deviation function, which we compute exactly for any . The corresponding large deviation functions in real space are worked out and checked with extensive numerical simulations. The results are complemented with a finite treatment based on the Laguerre-Selberg integral. The statistics of atypically small log-determinants is shown to be driven by the split-off of the smallest eigenvalue, leading to an abrupt change in the large deviation speed.

1 Introduction

The standard deviation of an array of data is the simplest measure of how spread these numbers are around their average value . Suppose that the ’s represent the final ‘Physics’ marks of students of a high-school. Most worrisome scenarios for the headmaster would be a low and/or a high , signaling an overall poor and/or highly non-uniform performance.

What if ‘Physics’ and ‘Arts’ marks are collected together? Detecting performance issues now immediately becomes a much harder task, as data may fluctuate together and in different directions. A two-dimensional scatter plot may help, though. The “centre” of the cloud gives a rough indication of how well the students perform on average in both subjects. But how to tell in which subject the gap between excellent and mediocre students is more pronounced, or whether outstanding students in one subject also excel in the other?

In Fig. 1 (Bottom) we sketch two scatter plots of marks adjusted to have zero mean. A meaningful spread indicator seems to be the shape of the ellipse enclosing each cloud. For example, an almost circular cloud - like School - represents a rather uninformative situation, where your ‘Arts’ marks tell nothing about your ‘Physics’ skills, and vice versa. Conversely, a rather elongated shape - like School - highlights correlations between each student’s marks in different subjects.

Figure 1: (color online) Top: Sketch of the probability density of the likelihood ratio of a Gaussian iid data set. In yellow, the typical region around the mean of order . Larger fluctuations are referred to as atypical large deviations. Bottom: Sketch of two multivariate data sets with and . Each point represents a student, for two different schools, and his/her marks in Arts and Physics. The two datasets have same generalized variance , but different total variance . The likelihood ratio of School 1 is compatible with the iid hypothesis, while the value of for School is atypically far from the average .

For a bunch of many scattered points it would be desirable to summarize the overall spread around the mean just by a single scalar quantity, like the perimeter or area of the enclosing ellipse. Not surprisingly, however, these indicators (taken individually) have evident shortcomings [30]. Surely a wiser choice is to combine more than a single spread measure (like perimeter or area alone), to obtain a more revealing indicator. These issues arise naturally in multivariate statistics, and more mathematical tools and techniques are required compared to the univariate setting.

In this work, we compute the joint statistics of “perimeter” and “area” enclosing clouds of random high-dimensional data. Why this is a crucial (and so far unavailable) ingredient for an accurate data analysis will become clearer very shortly.

In the more general setting of subjects and students, their marks can be arranged in a matrix , adjusted to have zero-mean rows. We then construct the normalized covariance data matrix , with non-negative eigenvalues , which is precisely the multi-dimensional analogue of the variance for a single array. The surface and volume (“perimeter” and “area” in the two-dimensional example) of the enclosing ellipsoid are related to the scaled trace and determinant of :

(1)

In statistics, these objects are called total and generalized variance respectively [2]. As discussed before, blending both estimators together would be preferable, like in the widely used positive scalar combination

(2)

called likelihood ratio [2], where is the log-determinant of

(3)

Values of for different shapes of the data cloud are sketched in Fig. 1 (Bottom).

Now, suppose that we wish to test the hypothesis that the data (yielding a certain empirical ) are independent and identically distributed. What if an atypically high or low (with respect to a null i.i.d. model) comes out from the data? We would be tempted to reject the test hypothesis outright. However, this might lead to a misjudgment, as atypical values of for the null model can (and do) occur (just very rarely). What is the probability of this rare event? Here we provide a solution to this problem, computing the joint statistics of total and generalized variance for a large Gaussian dataset.

The derivation of these results relies on techniques borrowed from statistical mechanics and random matrix theory (RMT). We express the large deviation functions of spread indicators as excess free energies of an associated 2D Coulomb gas, whose thermodynamic limit is analyzed in the mean-field approximation valid for with fixed. This approach is complemented with a finite analysis based on the “Laguerre” version of the celebrated Selberg integral. The marriage between these two techniques provides an elegant solution to a challenging problem. In addition, our unifying framework recovers and extends some partial results earned by statisticians via other techniques.

The article is organized as follows. In Section 2 we introduce the notation and we summarize our main results. We then elaborate at length on its consequences. Finally, we briefly discuss the relation of our findings with earlier works. Section 3 contains the derivations. First we summarize the “Coulomb gas method” and we present a quite general algorithm to find the large deviation functions of linear statistics on random matrices (Subsection 3.1). Then, in Subsection 3.2 we turn to the actual proof. In Section 4 we discuss two issues that are not captured by the Coulomb gas method. Finally we conclude with a summary and some open questions in Section 5.

2 Setting and formulation of the results

We consider an ensemble of matrices whose entries are real, complex or quaternion independent standard Gaussian variables 1, labeled by Dyson’s index and respectively, and we form the (real, complex or quaternion) sample covariance matrix

(4)

This ensemble of random covariance matrices (positive semi-definite by construction) is known as the Wishart ensemble [55] with rectangularity parameter . Remarkably, in the Gaussian case, the joint probability density of the positive eigenvalues of is known explicitly [25, 2]

(5)

where the energy function contains the external potential

(6)

The normalization constant is also known for any finite from the celebrated Selberg integral [45, 26, 3]. The joint law of the eigenvalues (5) is the Gibbs-Boltzmann canonical distribution of a 2D Coulomb gas (logarithmic repulsion) constrained to stay on the positive half-line and subject to the external potential at inverse temperature (we adopt the usual physical convention that probabilities are zero in regions of infinite energy). As we shall see, the derivation of our result is independent of the restriction or . Therefore, from now on we shall consider non-quantized values2 .

We consider the scaled log-determinant and trace of the covariance matrix . In terms of the eigenvalues , they read

(7)

Their joint probability law and Laplace transform are denoted respectively by

(8)

where the average is taken with respect to the canonical distribution of the eigenvalues (5). Here we are interested in the large behavior of and at logarithmic scales. More precisely, we show that for large

(9)

where stands for as .

The functions and are called rate function and cumulant generating function (GF) respectively [21, 51]. It is a standard result in large deviation theory that the functions and in (9) are related via a Legendre-Fenchel transformation.

Here we compute explicitly, for all and , the cumulant GF

(10)

From now on we shall denote . The main results of the paper are as follows.

A: Joint large deviation function of Generalized and Total Variances.  

Let be distributed according to (5)-(6) and let and as in (7). Their joint cumulant generating function defined by (10) exists for and and is given by

(11)

where and are the individual GF of cumulants of and . They are given explicitly by

(12)
(13)

with for .

We first discuss some consequences this result. The derivation is postponed to Section 3.

Remark 1: .

The large deviation functions , and are independent of . This property is standard for 2D Coulomb gas systems.

Remark 2: .

The joint cumulant GF is not the sum of the single generating functions: ( and are not independent for large ).

Remark 3: .

For the GF is analytic at and the joint cumulants of and are obtained by evaluating the derivatives of at . More precisely, to leading order in for

(14)

Note that . Extracting the first cumulants, we obtain to leading order in

(15)
(16)

where we set . The correlation coefficient , independent of , is positive for all values of (if the “area” increases, typically so does the “perimeter”). Notice that the expression of does not cover the case (square data matrices). This case will be treated separately in Section 4.

The decay of the higher order mixed cumulants for is given to leading order in by

(17)
Remark 4: .

The marginal probability densities and behave as

(18)

where and are the individual rate functions of and . These individual rate functions should also be in principle computable as inverse Legendre-Fenchel transform of (12)-(13). However, for the scaled log-determinant this is only possible for “not too small” values (); this point will be discussed in more details in Section 4. The expression of in the full range is instead remarkably simple

(19)

This analytic function is strictly convex and positive and it attains its unique minimum (zero) at (the asymptotic mean value of , see (15)). This rate function provides information on the large full probability density of . We can identify three regimes:

  • typical fluctuations of order about the average are described by the quadratic behavior of around its minimum at , corresponding to asymptotically Gaussian fluctuations with mean and variance as in (15)-(16);

  • large deviations for (atypically large “perimeters”) exhibit an exponential decay (independent of the rectangularity parameter );

  • for (atypically small “perimeters”) we find a -dependent power law.

Summarizing:

(20)

These predictions have been confirmed by extensive numerical simulations. A sample size of about spectra of complex () Wishart matrices has been efficiently generated using a tridiagonal construction [47]. The data are plotted in Fig. 2 and show a very good agreement with the behavior in (18).

Figure 2: Numerical simulations (black circles) of complex () Wishart matrices of size with . Here the sample size is . Left: The numerical values (black circles) for the total variance . The Gaussian approximation (orange line) with average  (15) and standard deviation (16) fits well the data for fluctuations of order but deviates strongly for atypical fluctuations. The global behavior is captured instead by the large deviation function (blue line) of (19). Right: Numerical values (black circles) for the log-determinant . Again, the Gaussian approximation (green line) with average  (15) and standard deviation (16) describes well the data for fluctuations of order but deviates for larger fluctuations. The large deviation function (red line) of (45) provides a global description of the data. The critical point (below which the large deviations change speed from to ) is not visible in the picture (for and the critical point is at to the left of ).

Once the joint large- behavior of generalized and total variances is known, one may easily derive a large deviation principle for any continuous function of them. For instance, from , it is easy to compute the Laplace transform of the likelihood ratio as . Hence we have the following result.

B: Large deviations of the likelihood ratio.  

The likelihood ratio cumulant GF is given by

(21)

with as in (11). With the same notation as above, the cumulants of at leading order in follow by differentiations

(22)

( is the Heaviside step function) for . This corresponds to typical fluctuations on a region around the mean

(23)

with variance3

(24)

Note that, since and are not independent, the cumulants (22) of involve the extra term .

From Result A, extracting the asymptotics of the first moments of and for we recover classical results in multivariate analysis, valid when the sample size is much larger than the number of variates .

C: Classical statistics.  

In the regime , and become asymptotically Gaussian. More precisely, as

(25)

in distribution, where denotes a standard Gaussian variable.

To conclude this introductory section, we remark that our findings reproduce some known results for the typical fluctuations (mean and variance) of and separately, in the real case () [2, 31, 4, 32, 29]. Moreover, the variances and covariances (16) and (24) can be computed for generic using covariance formulae valid for one-cut -ensembles of random matrices [6, 12].

A precious tool in classical statistics is the Barlett decomposition [5], which is useful to transform functions of strongly correlated eigenvalues of Wishart matrices (see (5)) into functions of independent (but not identical) chi-squared random variables. In the asymptotic regime this decomposition becomes sufficiently manageable to derive some interesting results. For real matrices, the limits (25) agree with classical theorems based on the Barlett decomposition (see e.g. [40]). From the results on , the statistical behavior of the scaled determinant can be easily derived. For statistics of determinants of random matrices (more general than the sample covariance matrices considered here), see [10, 43, 33, 49]. For more details on the classical methods in multivariate analysis we refer to the classical books [40, 30] and the excellent review [31] on the applications of RMT in multivariate statistics.

3 Derivation

We now turn to the derivation of Result A. Results B and C follow as corollaries and hence their proof will be omitted. In Subsection 3.1 we set up the variational problem in the framework of the 2D Coulomb gas thermodynamics. The Coulomb gas analogy for spectra of random matrix ensembles goes back to the seminal works by Wigner [54] and Dyson [20]. In particular, it was Dyson who first used this analogy to compute large random matrix statistics. This idea has been developed later and used in several areas of physics [8, 9, 17, 53, 22, 18, 34, 35, 23, 50, 36, 37, 28, 13, 14, 11, 16, 15]. In Subsection 3.2 we solve the saddle-point equations and we compute explicitly , thus proving Result A.

3.1 2D Coulomb gas problem

The Coulomb gas calculation goes as follows. First we observe from (5)-(6) that the joint Laplace transform (8) is finite for and . From (5) and (7)-(8), this Laplace transform can be written as the ratio of two partition functions

(26)
(27)

is the partition function of a constrained Coulomb gas, where the energy function contains now the additional single-particle potential

(28)

Note that is the partition function of the unconstrained gas and therefore it coincides with the normalization constant in (5).

Hence, the computation of the joint cumulant GF amounts to evaluating the leading order in of the partition function . More precisely, from (26)-(27), may be expressed as the excess free energy

(29)

of the Coulomb gas in the effective potential with respect to the unperturbed Coulomb gas (). This effective potential is bounded from below for and (the domain of existence of the Laplace transform , of course). For any finite , the excess free energy (29) can be computed exactly in terms of a Laguerre-Selberg integral [45] (see Section 4). How to deal with the limit of large ? We show now how the task of computing can be reduced to a variational problem.

First, we introduce the normalized density of the gas particles , in terms of which any sum function on the eigenvalues can be easily expressed. For instance, the log-determinant and trace (7), both linear statistics on , are conveniently expressed as linear functionals on as

(30)

Second, for large , the energy function of the 2D Coulomb gas can be converted into a mean-field energy functional , where

(31)

The mean-field functional (31) has been intensely studied in several fields. We refer to [44, 46, 16] for a detailed exposition and collection of known results. In particular, it is known that for large the partition function is dominated by , the unique minimizer of the mean-field energy functional in the space of normalized densities:

(32)

The meaning of the saddle-point density is the following: is the typical configuration of the eigenvalues yielding a prescribed value of log-determinant and trace

(33)

Hence, a possible route to evaluate consists of finding for all the saddle-point density and inserting it back into the energy functional (31) to evaluate the leading order of as in (32). This technique has been exploited in the last decade in many physical problems, mainly to compute the large deviations of single observables. However, this route entails the explicit computation of the mean-field energy (31) at the saddle-point density, which is not necessarily an easy task. The situation gets even worse in the case of joint statistics.

In certain situations one can use a shortcut (see [16]) based on a thermodynamic identity that has been stated rigorously in the language of large deviation theory [51] by Gärtner [27] and Ellis [21]. It is known that, if a cumulant GF is differentiable in the interior of its domain, then the rate function is the Legendre-Fenchel transform of the cumulant GF (and hence, is the inverse Legendre-Fenchel transform of ). This relation between rate function and cumulant GF can be exploited in our problem as follows (for a general mathematical discussion we refer to [16]). We assume first that is differentiable. Therefore, the Gärtner-Ellis theorem ensures that is also smooth and given from by the Legendre-Fenchel transformation

(34)

The identity (34) can be written in the (almost) symmetric form

(35)

This equation should be interpreted with care. Indeed, in (35), there are only two independent variables, for instance and or and . The relation between the conjugate variables and is provided by

(36)

where and are given in (33) and , are the corresponding inverse maps. Hence, we can write the differential relations

(37)
(38)

supplemented with the normalization condition (and hence ). The expressions (37) and (38) can be interpreted as Maxwell relations among thermodynamic potentials, in our case the Helmholtz free energy and the enthalpy. However it is somewhat astonishing that these relations have not been applied in the Coulomb gas computations until very recently (for applications of (37)-(38) in physical models see [13, 14] and also [28]).

Using (37)-(38) one can use the following shortcut to compute the large deviations functions (for a detailed exposition we refer to [16]). For instance, in order to compute we only need to find the saddle-point density and compute and from (33). Then, follows from integration of (37)

(39)

This shortened route of the Coulomb gas method provides an effective tool to evaluate large deviations functions. A large amount of unnecessary computations can be avoided and the task of computing joint large deviations becomes feasible. In the next subsection, we will use this strategy to derive our main result.

3.2 Saddle-point equation and large deviation functions

The first problem to overcome is to find the saddle-point density of the mean-field functional . From (31), the stationarity condition of reads

(40)

where denote the support of (for the left hand side is greater than or equal to the same constant). The physical meaning of (40) is clear: at equilibrium, the 2D Coulomb gas arranges itself in such a way that each particle has equal electrostatic energy (the left hand side of (40)).

Taking one further derivative with respect to , the resulting singular integral equation can be solved for using a theorem due to Tricomi ([52, Sec. 4.3]), and the result reads

(41)

where the edges of the support depend on and as

(42)

For the density of the unperturbed gas coincides with the Marčenko-Pastur distribution [38] with edges . For the saddle-point density is bounded while at the lower edge reaches the origin and acquires an inverse square root divergence there.

From the corresponding values of scaled log-determinant and trace are

(43)

with for (hence is defined for ). Combining (43) and (39) we obtain the cumulant GF as in Result A.

In principle, one may also compute the rate function in the same way. However, it is not possible to write down and (the inverse maps of (43)) in terms of elementary functions. For simplicity, however, we show how to carry out the explicit computation for trace and log-determinant separately and establish the large decay as in Remark 4. Setting we find from (43), and we immediately get the rate function of the scaled traces from (39) by integrating (the inverse of )

(44)

This proves (19). Similarly, for the log-determinant we have

(45)

valid for (see discussion in the next Section), where is the inverse of .

4 Further results and discussion

The treatment in the previous section does not cover the following two issues:

  • The case (square data matrices), for which the leading term of the variance of (computed from the approach described above) is not defined (see Eq. (16)). What is the origin of this hitch?

  • The origin of the condition for the validity of the rate function in (45) seems mysterious. What is the mechanism governing the statistics of “anomalously small” log-determinants, then?

We discuss these two issues in detail here.

4.1 The case (square data matrices)

As already disclosed, if is a Wishart matrix with () the limiting variance of is not described by our large deviations result (see Eq. (16)). The origin of this hitch is as follows. Recall that the cumulant GF is defined for . Hence, for (i.e. ) is non-analytic in and the cumulants cannot be obtained by differentiation.

A way to circumvent this problem is to first compute for finite , and then evaluate its large asymptotics. The joint Laplace transform of and can be indeed evaluated exactly also at finite , using the Laguerre-Selberg integral [45, 26, 3]:

(46)

Using this identity, one may evaluate the Laplace transform as

(47)

with as in (6). We have verified that our large formulae reproduce with good accuracy the finite result even for moderate values of .

From (47) it is possible to extract the large deviation functions for the scaled trace (this corresponds to ). On the other hand, the asymptotic in the variable is not trivial.

However we can use this exact result to deduce for symmetric data matrices (), the case that was not covered by our Result A. Setting and in (47), we denote by the Laplace transform of at finite . We can compute the derivatives of as

(48)
(49)

where is the -Polygamma function [1]. In principle one can compute higher derivatives recursively, and evaluate the asymptotic values of the cumulants of . For instance, average and variance of are related to the derivatives and at . Using the normalization we get

(50)

Using the Euler-Maclaurin summation formula [1] (with ), and the classical asymptotic , valid for , with and , we obtain for large the limit value , according to (15) for . A similar analysis of the Laguerre-Selberg integral (for ) was performed in [10], but it was restricted to the computation of at leading order in . Here we tackle the problem of the variance of for square data matrices. From (49) we obtain

(51)

After a somewhat lengthy calculation, we managed to extract the large asymptotics

(52)

with a constant given by

(53)

Some special values are . This result has been verified numerically, see Fig. 3. Note the logarithmic growth of (52) with , in contrast to the limiting behavior of for . Such a logarithmic divergent variance is customary for discontinuous spectral linear statistics in RMT, the paradigmatic example being the number variance [20, 34, 36, 39]. Notice that the function is indeed discontinuous at . However, as long as the support of the equilibrium measure does not contain the origin and this singularity is ineffective; only for we have , and at that point the singularity of starts being felt. The central limit theorem with logarithmically divergent variance for has been proved in [42, 41] for . The subleading corrections to in (52) are instead a new result.

Figure 3: Rescaled variance of the log-determinant for . Each point is produced sampling Wishart matrices of size for and . The error for each point is of order , not visible in the picture. The solid lines are the exact result (52).

4.2 The statistics of atypically small log-determinants

We have claimed earlier that the rate function of can be computed as the inverse Legendre-Fenchel transform of only for . Why is this the case? As a matter of fact, the Gärtner-Ellis theorem has two hypotheses: first, the cumulant GF is required to be differentiable in the interior of its domain; second, the derivatives of the cumulant GF should diverge on the boundaries of the domain (a condition known as steepness [21, 51]). In our case, is differentiable for all but the left derivative attains a finite value at the boundary point : as . Hence, only a local version of the Gärtner-Ellis theorem holds and with being the branch of the Legendre-Fenchel transform of for .

The 2D Coulomb gas analogy provides a rather intuitive physical picture of this obstruction. We have seen that the saddle-point density is bounded as long as (see (41)-(42)). When , the lower edge of the saddle-point density reaches the origin and acquires an inverse square-root singularity there. For , the logarithmic part of the effective potential becomes attractive, giving rise to an electrostatic instability of the gas. As already discussed, is the typical distribution of the eigenvalues of yielding a prescribed value of . Setting to simplify the discussion, we see that as long as and the critical value corresponds to the critical density obeying