Parameter inference with estimated covariance matrices

# Parameter inference with estimated covariance matrices

## Abstract

When inferring parameters from a Gaussian-distributed data set by computing a likelihood, a covariance matrix is needed that describes the data errors and their correlations. If the covariance matrix is not known a priori, it may be estimated and thereby becomes a random object with some intrinsic uncertainty itself. We show how to infer parameters in the presence of such an estimated covariance matrix, by marginalising over the true covariance matrix, conditioned on its estimated value. This leads to a likelihood function that is no longer Gaussian, but rather an adapted version of a multivariate -distribution, which has the same numerical complexity as the multivariate Gaussian. As expected, marginalisation over the true covariance matrix improves inference when compared with Hartlap et al.’s method, which uses an unbiased estimate of the inverse covariance matrix but still assumes that the likelihood is Gaussian.

###### keywords:
methods: data analysis – methods: statistical – cosmology: observations
12

## 1 Introduction

A very common problem in statistical inference concerns data that are Gaussian-distributed. The likelihood of the observed data is a multivariate Gaussian, characterised only by a mean data vector and a covariance matrix :

 G(Xo|μ,Σ)=1√|2πΣ|exp[−12(Xo−μ)TΣ−1(Xo−μ)]. (1)

The posterior probability of the parameters is proportional to the likelihood, now treated as a function of the parameters (through the dependence of the mean and the covariance matrix), multiplied by a suitable prior. Ideally one has analytic expressions for the mean and covariance in terms of the model parameters, but in many cases these are not available, and one or both may need to be estimated from simulated data which mimic the experiment that is to be analysed (e.g., Semboloni et al. (2006); Heymans & et al. (2013)), or from the data themselves (e.g., Budavári & et al. (2003)). However, although an unbiased simulated covariance matrix can be constructed, its inverse is not an unbiased estimator of the inverse (or precision) matrix , which is what is needed in the likelihood Eq. (1). One can construct an unbiased estimator of by a rescaling of (Anderson, 2003), as advocated by Hartlap et al. (2007). This widens up the credible intervals. If simulations are computationally cheap, then one can generate a large number of simulated datasets and obtain an accurate estimate of the covariance matrix. This asymptotic regime occurs only when far exceeds the size of the data vector, . In many practical cases this is not possible, and the number of simulated datasets is small, with the consequence that statistical noise in the precision matrix propagates into errors in the parameters (Taylor et al., 2013; Dodelson & Schneider, 2013; Hamimeche & Lewis, 2009). However, there is a more fundamental difficulty with the approach adopted, as it assumes that the likelihood is still Gaussian, albeit with a different precision matrix, whereas in fact it is not.

A principled way to tackle the problem is to recognise that the simulated data provide samples of the covariance matrix, so is itself a random object, based on a number of simulations. For Gaussian data, we have the advantage that the sample distribution of is known, for a given true covariance matrix , and we can exploit this, with a suitable prior, by constructing the probability of conditional on the sample , and then marginalising over the unknown covariance matrix . This can be done analytically for our preferred choice of Jeffreys prior for . As a consequence, we properly propagate the uncertainty in the covariance matrix into the final inference, computing the quantity we want, i.e., the likelihood given the simulated covariance matrix and the number of samples on which is it based: . This object, where we keep the dependence on the number of simulated datasets explicit to emphasize its importance, is the main result of this paper. It is not Gaussian, but rather follows a modified version of the multivariate -distribution. In practical terms, it is no more expensive to compute than the Hartlap-scaled Gaussian likelihood, but statistically sound, and can be retrospectively applied to many analyses that have used a different likelihood function by appropriate re-weighting of points, provided that the chains adequately sample the parameter space that the -distribution favours.

## 2 Replacing a true covariance matrix by an estimator

When inferring cosmological model parameters from a data set, we usually have just one p-dimensional observed data vector , which is a single realization of a statistical process which we assume to be a multivariate Gaussian of which the mean , and the covariance matrix may depend on the parameters

 Xo∼Np[μ(θ),Σ(θ)]. (2)

In the following, we suppress this dependence on the parameters but it is still implied.

If were known precisely, the likelihood would be the Gaussian, Eq. (1). However, if is unknown, and all we have is an estimator , then the likelihood must be replaced by another likelihood of which we will show that it is not a Gaussian.

One method - viable for Frequentists - of estimating the covariance matrix, is to draw further independent data vectors from the distribution of and to calculate their sample covariance. Typically, such repeated independent measurements are however impossible in cosmology. Nonetheless, if we can simulate the observation, then we are able to generate further samples , that are statistically equivalent to the single observation . The covariance matrix can then be estimated from these simulations, and the likelihood that we require is the probability of the data, given and the number of simulations on which it is based, i.e., .

If we run independent simulations, then is the average, and an unbiased estimator of is

 S=1N−1N∑i=1(Xi−¯X)(Xi−¯X)T. (3)

In the following subsection, we derive an analytical replacement for the Gaussian likelihood, Eq. (1), conditioned on an estimate , and from Sect. 4.1 onwards we study the effects of this replacement on parameter inference.

### 2.1 Derivation of the multivariate t-distribution

We now derive the likelihood that depends on an estimator instead of the true covariance .

Any matrix of the type is by construction a Wishart matrix (Wishart, 1928; Mardia et al., 1979; Anderson, 2003), if is drawn from a multivariate Gaussian. When estimating a covariance matrix by averaging over random samples drawn from simulations, the estimated covariance matrix is a Wishart matrix, and has a Wishart distribution (Anderson, 2003),

 W(S|Σ/n,n)=|S|n−p−12exp[−12nTr(Σ−1S)]2pn2|Σ/n|n2Γp(n2) (4)

where we call the degrees of freedom and is the -dimensional Gamma function. By the central limit theorem, this distribution is also asymptotically appropriate if the sampling distribution of is non-Gaussian.

We can invert this distribution to yield the distribution of the true covariance matrix conditioned on the estimator , by using Bayes’ Theorem

 P(Σ|S,N)π(S)=W(S|Σ/n,n)π(Σ) (5)

and adopting priors . Since the determinant of the positive-definite covariance matrix is strictly positive, it is a scaling parameter, and we therefore assume the independence-Jeffreys prior (Jeffreys, 1961; Sun & Berger, 2006)

 π(Σ)∝|Σ|−p+12. (6)

This is by construction invariant under reparametrizations, and can therefore be regarded as uninformative, independent of the choice of parameters.3 We then have

 P(Σ|S,N) ∝ W(S|Σ/n,n)π(Σ) (7) ∝ |Σ|−n+p+12exp[−12nTr(Σ−1S)] ∝ W−1(Σ|nS,n)

showing that the uncertainty of the unknown true can be described by an inverse Wishart distribution, conditioned on the sample estimate,

 W−1(Σ|C,n)=|C|n2|Σ|−n+p+12exp(−12Tr(Σ−1C))2np2Γp(n2) (8)

where we used . Increasing the estimates, , of the covariance matrix, will make this distribution more sharply peaked, reflecting the improvement of the estimation.

Given the distribution Eq. (8), we can now marginalize the Gaussian likelihood over the unknown covariance, to find what we are after, which is the likelihood of the data , given a mean and an estimate of the covariance matrix from simulations:

 P(Xo|μ,S,N)=∫dΣG(Xo|μ,Σ)P(Σ|S,N) (9) ∝ ∫dΣ|Σ|−N+p+12exp[−12Tr(Σ−1Q)]

where we have defined . The last line is structurally the integration over an unnormalized inverted Wishart distribution , so the result is the normalization constant as in Eq. (8), leading to

 P(Xo|μ,S,N)∝|Q|−N2. (10)

Resubstituting , using the matrix identity

 |A+bbT|=|A|(1+bTA−1b) (11)

and normalizing, we arrive at the likelihood for the -dimensional dataset , conditioned on the mean and a sample of the covariance matrix from simulations:

 P(Xo|μ,S,N)=¯cp|S|−1/2[1+(Xo−μ)TS−1(Xo−μ)N−1]N2 (12)

This is a cosmologist’s version of a multivariate -distribution. It is not the standard (Frequentist) multivariate -distribution, which jointly estimates the mean and its covariance from a data set of data vectors. In contrast, we have assumed exactly one data vector that determines where the likelihood will peak - and simulated vectors from which we estimate the covariance. The normalization is

 ¯cp=Γ(N2)[π(N−1)]p/2Γ(N−p2), (13)

where is the usual Gamma function and we require . For expensive simulations, when a feasible is still comparable to , the differences between a Gaussian and the -distribution become important. So if a covariance matrix must be replaced by an estimator from simulations, the modified -distribution Eq. (12) replaces the multivariate Gaussian Eq. (1). This is the main result of the paper.

## 3 Attempting to debias a Gaussian likelihood

Instead of using the -distribution Eq. (12) it has become standard in cosmology to follow a procedure outlined by Hartlap et al. (2007), where the authors propose to stick with a Gaussian likelihood, and only to replace the true inverse covariance matrix by a scaled inverse sample covariance matrix

 Σ−1→αS−1 (14)

with

 α=N−p−2N−1. (15)

This is motivated by the fact that follows an inverse Wishart distribution, which has a biased expectation value as shown in Anderson (2003). Here, the angular brackets denote averaging over the inverse Wishart distribution.

Hartlap et al. (2007) argue that this debiased inverse covariance matrix will remove all biases from parameter inference. However, the situation is more complex. In a Bayesian anlaysis one would not necessarily define an estimator , but if one does, the bias is where the angular brackets now denote the average over the likelihood of the parameters. Adopting the wrong sampling distribution will yield incorrect posterior distributions, with biased parameter estimates (should they be made) and incorrect errors, even if the inverse covariance matrix itself has been debiased.

We compare univariate examples of the likelihoods and the modified -distribution Eq. (12) in Fig. 1: the Hartlap-scaled and the unscaled Gaussian only differ in width, whereas the -distribution has a more sharply peaked central region but broader extreme wings than a Gaussian, allowing for more scatter away from the peak.

Additionally, the scaling in Eq. (14) implies a sharp mapping between the estimator and , which does not account for the randomness of , due to the finite width of the inverse Wishart distribution. Therefore, applied to a single given should not be interpreted as a reliable ‘debiasing’ but rather a scaling that widens up the Gaussian likelihood Eq. (1) in an essentially random way. This randomness will propagate through the parameter inference and introduce a scatter of the likelihood contours of which we show a simple example in Fig. 2. This scatter can only be reduced by estimating the inverse covariance matrix more precisely, see also (Taylor et al., 2013; Dodelson & Schneider, 2013).

## 4 Comparison of the distributions

### 4.1 Illustrative univariate example

We illustrate with a univariate frequentist example that the Hartlap-scaled Gaussian introduces errors into the parameter inference, whereas Eq. (12) does not. We choose a true mean , which we want to estimate in the following. We then produce 10,000 Gaussian data sets with this mean, and produce 150 estimates of the covariance matrix from further samples (where determines ). For each data set and each covariance matrix, we then calculate the Hartlap-scaled likelihood and the modified -distribution. Both the Hartlap-scaled Gaussian and the -distribution of make quantitative predictions such as stating that will fall 5% of the time into the lower 5% tail of the likelihood, or 68% of the time into the 68% likelihood contour, given some data sets. But since the two likelihoods differ in shape, their lower-tail probabilities and likelihood contours will also differ, and only one will make the correct quantitative predictions. Since we know the true mean, we can test this. Likelihood contours and tail-probabilities can be converted into each other, so it is sufficient to test only one. We choose the lower -percent tail probability, i.e. the cumulative probability function and check whether the -percent cumulative distribution does indeed cover the true mean percent of the times. In Fig. 3, we find that only the -distribution correctly reproduces the cumulative distribution - the line is straight with a slope of unity. The Hartlap-scaled Gaussian does not capture the scatter around the peak correctly, which will lead to a misestimate of the parameter errors, even on average. As expected, the discrepancy decreases as more simulations are included in the estimation of (i.e., as ).

### 4.2 Assessment of confidence in higher dimensions

The issue at hand can be studied in higher dimensions by investigating the distribution of the following quantities:

 χ2=(Xo−μ)TΣ−1(Xo−μ) (16)

which is the true ; the same quantity but with the estimated replacing ,

 T2=(Xo−μ)TS−1(Xo−μ); (17)

and the Hartlap-scaled version

 H2=(Xo−μ)TαS−1(Xo−μ). (18)

By construction, we have , meaning the Hartlap-scaling does indeed debias the expectation value. It does however underestimate statistical scatter, as we shall show in the following.

follows the -distribution, which only arises if the covariance is precisely known and indeed the correct covariance of . The quantity will not follow the -distribution, because it contains not only a random vector , but additionally the random estimate of the covariance matrix that follows the Wishart distribution . therefore follows

 T2(n−p+1)pn∼Fp,n−p+1 (19)

where , and the is the F-distribution of and degrees of freedom (Anderson, 2003). Consequently, a change of variables shows that,

 T2∼Γ(n+12)Γ(p/2)Γ[(n−p+1)/2]n−p/2(T2)p/2−1(T2/n+1)n+12. (20)

instead of , see Fig. 4. Only for will the Wishart distribution tend towards a delta-function, and the distribution of will then tend towards a -distribution.

The distribution of the Hartlap-scaled is more sharply peaked than that of , thereby suggesting that the experiment has less statistical scatter than the distribution on average. This is impossible since the distribution is subject to scatter of the random vector only.

The cumulative probabilities or give our confidence that the mean of the multivariate vector is enclosed within an ellipsoid bounded by or . The more slowly rising cumulative distribution function of therefore shows that we need in order to achieve the same confidence that the mean is captured within the confidence contours. In parameter space, this will lead to an increase of the Bayesian confidence intervals.

### 4.3 Reweighting an MCMC chain that sampled from a Gaussian likelihood.

We have shown above that , and follow different distributions, which will affect parameter inference. Often, the error of confusing a with a or can retrospectively be undone with very little numerical effort by reweighting an existing MCMC chain.

In Fig 5 we plot weights for reweighting a chain that sampled from . If a Hartlap-scaling has been applied, it would additionally need to be removed.

We note that the maximum of the -distribution in the full parameter space coincides with the maximum of (and also of ), but once any parameters are marginalised over, the resulting parameter posteriors will not in general peak in the same place.

## 5 Conclusions

We have studied how statistical uncertainties in an estimated covariance matrix affect parameter inference. We summarize our findings as follows.

For data drawn from a multivariate Gaussian, the likelihood will be Gaussian if the data covariance is exactly known. If however the covariance is estimated from simulations, , the estimator is unbiased, but is not an unbiased estimator of . An unbiased estimator is where (Anderson, 2003). An earlier proposal, by Hartlap et al. (2007), uses the unbiased estimate of the inverse covariance matrix, but keeping a Gaussian likelihood. The statistical scatter of the estimator is not fully accounted for, and this yields posteriors that are on average simultaneously too broad in their centres, yet not broad enough in the extremes.

The principled approach is to recognise that we have a sample of the covariance matrix , and compute the likelihood by marginalising over the inverse-Wishart distribution of the true covariance matrix , conditioned on . This gives a modified multivariate -distribution , given by Eq. (12). This is what we require for parameter inference and is the main result of this paper.

For parameter inference in the presence of a covariance matrix estimated from a finite number of simulations, our results imply that MCMC chains should evaluate the modified -distribution Eq. (12) at each sample point, instead of a Gaussian distribution. The numerical complexity will not be increased by this. It stays constant since both distributions must evaluate the quantity . Consequently, a reweighting of existing MCMC chains is possible without much effort if the chains record .

## 6 Acknowledgements

ES acknowledges financial support from the RTG Particle Physics beyond the Standard Model, through the DFG fund 1940 and the transregional collaborative research centre TR 33 ’The Dark Universe’ of the DFG. We thank Andrew Jaffe, Justin Alsing and Ewan Cameron for useful discussions and comments.

### Footnotes

1. pagerange: Parameter inference with estimated covariance matricesReferences
2. pubyear: 2015
3. The power also leads to degrees of freedom in the inverse Wishart distribution, which is an intuitive result. Another power would only change the degrees of freedom, showing that the influence of the prior can be lessened by increasing the number of simulations .

### References

1. Anderson T. W., 2003, An Introduction to Multivariate Statistical Analysis. Wiley Series in Probability and Statistics, 3rd ed.
2. Budavári T., et al. 2003, \apj, 595, 59
3. Dodelson S., Schneider M. D., 2013, \prd, 88, 063537
4. Hamimeche S., Lewis A., 2009, \prd, 79, 083012
5. Hartlap J., Simon P., Schneider P., 2007, A,A, 464, 399
6. Heymans C., et al. 2013, MNRAS, 432, 2433
7. Jeffreys H., 1961, Theory of Probability (OUP)
8. Mardia K. V., Kent J. T., Bibby J. M., 1979, Multivariate analysis. Probability and Mathematical Statistics, London: Academic Press
9. Semboloni E., Mellier Y., van Waerbeke L., Hoekstra H., Tereno I., Benabed K., Gwyn S. D. J., Fu L., Hudson M. J., Maoli R., Parker L. C., 2006, \aap, 452, 51
10. Sun D., Berger J., 2006, Proc. Valencia/ISBA 8th World Meeting on Bayesian Statistics
11. Taylor A., Joachimi B., Kitching T., 2013, \mnras, 432, 1928
12. Wishart J., 1928, Biometrika, 20, 32
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters