A Complementary Motivation for Matching Moments

# GLASS: A General Likelihood Approximate Solution Scheme

## Abstract

We present a technique for constructing suitable posterior probability distributions in situations for which the sampling distribution of the data is not known. This is very useful for modern scientific data analysis in the era of “big data”, for which exact likelihoods are commonly either unknown, computationally prohibitively expensive or inapplicable because of systematic effects in the data. The scheme involves implicitly computing the changes in an approximate sampling distribution as model parameters are changed via explicitly-computed moments of statistics constructed from the data.

## 1 Introduction

Bayesian inference [1] is now commonly used and understood as being the correct way to learn about models from data. Posteriors for model parameters are related via Bayes’ Theorem to the product of priors for the said parameters and the likelihood function, which is the probability for the data given the model, considered as a function of the model parameters:

 p(q|x)=p(x|q)p(q)p(x). (1)

A difficulty presents itself in making accurate inferences if the true likelihood is not known or is unfeasible to repeatedly calculate. Reasons might include systematics in the data rendering an idealized likelihood unusable, or simply computational cost. One might still have an idea for a good choice of statistics to represent the data. This could be inspired say by “robustness” to systematics, by analogy to analysis procedures for idealized cases, or empirically by investigation of simulations. If certain quantities, such as means and variances, are calculable (or estimatible via simulations) as functions of the model parameters in a reasonable amount of time, one might hope to be able to make some plausible inferences. Whilst in the past people have been able to build approximations heuristically (see e.g. [2, 3, 4] in a cosmological context), here we present a general scheme for constructing suitable likelihoods in such situations. The scheme should be relevant specifically for cosmic microwave background analysis, galaxy redshift surveys and the like, but is of general applicability.

This paper is organised as follows. First, the method is introduced and the main result derived in Sec. 2. Next, examples are presented in Sec. 3. Conclusions and further work are given in Sec. 4. Appendices A, B and C discuss various technical issues regarding the approximation including its derivation, practical use and validity.

## 2 Basic Procedure

The underlying idea is to use the principle of maximum entropy to construct the broadest (i.e. least presumptive) sampling distribution consistent with a) what one assumes and b) with what one has managed to calculate about the statistics of the samples in the context of a model [1]. This, evaluated for the data, is used as the likelihood, which, when multiplied by the prior, gives the approximate posterior for the model given the data as in Eq. (1). (See Appendix A for a complementary motivation for our approach.)

Say, for example, one can calculate the mean and the variance of some statistic for a model parametrized by a parameter . (We use to denote moments and to denote cumulants of the indicated quantities.) Before even calculating anything, one might also have a “prior” on , such as it being positive for example. One then needs to maximise the entropy,

 H(p)=−∫dxp(x)lnp(x)p0(x), (2)

using Lagrange multipliers to impose the desired constraints on the distribution along with normalisation of . This yields

 p(x)=p0(x)exp(−λxx−λxxx2)∫dxp0(x)exp(−λxx−λxxx2). (3)

Generally, the lagrange multipliers must be solved for numerically, repeatedly evaluating and as a function of and until and are obtained. The appropriate multipliers may be denoted and . Substituting these into Eq. (3) gives our approximate sampling distribution for for the model with parameter . If is then found to have some value, our approximate likelihood for is then given by evaluating Eq. (3) for that . By analogy with conventional notation in statistical dynamics we denote the denominator of Eq. (3) by , and we can introduce the “action” (after classical/quantum mechanics) as :

 S(x,q)=−logp0(x)+λx(q)x+λxx(q)x2+logZ(λ(q)). (4)

Multiplying by a desired prior on the parameter , one then has a suitable approximate (unnormalized-) posterior for in light of the data , appropriate for use for inference.

In principle, this procedure is easily extensible to multi-dimensional data , , described by multi-dimensional model parameters , , and to use higher moments. The Lagrange multipliers become labelled by indices, , , and so on and will be implicit functions of the . The entropy becomes a multidimensional integral, and the action is

 Missing or unrecognized delimiter for \left (5)

with summation implied over repeated indices. Here parentheses around indices indicate their symmetrization, e.g. , and we take all Lagrange multipliers to be symmetric since any non-symmetric part would not contribute to (5)1. Now

 Z(λ(qa))=∫dnxp0(x)e−λixi−λ(ij)xixj−λ(ijk)xixjxk−⋯. (6)

One varies the Lagrange multipliers until all of the desired multi-dimensional moments are matched2 and the probability for the distribution becomes

 p(x|q)dnx=e−S(x,q)dnx. (7)

In practice, the procedure rapidly becomes difficult to perform as the dimension increases, because of the increasing difficulty of evaluating the multidimensional numerical integrals required to solve explicitly for the Lagrange multipliers.

However, within a given class of models, we can get away without having to solve explicitly for the Lagrange multipliers as follows. Let us introduce the vector to denote . We may indicate a specific component of with a superscript . (Nb. this component could contain one, two or more powers of the .) Similarly, we can introduce to denote the vector of associated Lagrange multipliers, with components . Now we derive a set of relations between moments from the distribution given by Eq. (7). From the form of (7), we have:

 ⟨XI⟩(λ) = −∂logZ∂λI, (8) ⟨⟨XIXJ⟩⟩(λ) = ∂2logZ∂λJ∂λI. (9)

If we have managed to find such that the desired are obtained in a neighbourhood of , then we can consider differentiating Eq. (8) with respect to (sometimes denoting by the shorthand ):

 ⟨XI⟩,a = −∂2logZ∂λJ∂λI∂λJ∂qa (10) = −⟨⟨XIXJ⟩⟩λJ,a.

Meanwhile, differentiating the action (5) with respect to gives

 S,a=(XI−⟨XI⟩)λI,a (11)

with the term coming from the via Eq. (8). It is worthwhile noting that the “prior” on the data has disappeared explicitly. Now, we should be able to invert Eq. (10) to solve for the in terms of the derivatives of the first moments and the second cumulants :

 λ,a=−⟨⟨XXT⟩⟩−1⟨X⟩,a (12)

(adopting a matrix notation). Substituting into Eq. (11) we obtain our main result:

 S,a=−(X−⟨X⟩)T⟨⟨XXT⟩⟩−1⟨X⟩,a, (13)

in which the Lagrange multipliers do not appear.

The scheme is then to obtain the desired moments of the , their derivatives with respect to the and their second cumulants from the theory in question, ideally by calculation or potentially also by simulation. One then uses them in Eq. (13) to obtain the gradient. This gradient is then integrated between two points in parameter space to obtain the difference in between them.

One option for a likelihood would be to integrate up from a fiducial choice of to the values in question. (Note that as the gradient generally varies along the path and the integration takes this into account, such a likelihood would not generally be a linear expansion in parameter shifts around a fiducial model.) Alternatively one might integrate in steps between two nearby models under consideration in an MCMC chain for example. When models vary smoothly with their parameters, the Romberg integration method (see e.g. [5]) has been found to work well and to converge quickly. In multi-dimensional situations one can choose a path in parameter space and perform a line integral, expressing the one-dimensional gradient along the path in terms of the partial derivatives of and the rate of change of parameters along the path using the chain rule. (Alternatively, one can use directly in a sampling method that only uses the gradient of the likelihood.)

Equation (13) is framed in terms of the means of the , their derivatives with respect to parameters, and their covariance. With the being powers of the , such objects are expressible in terms of cumulants of the latter and their derivatives with respect to parameters. This allows one to formulate a version of Eq. (13) “reduced down” from the to the , which may be easier to handle if the theory more directly gives cumulants of the rather than of the . This is discussed further in Appendix B.

There is an approximation that is not at first sight obvious in this procedure. Equation (13) applies for the distribution defined by Eq. (7). While agrees by construction between the underlying sampling distribution and its approximation in (7), not all of the moments required in need necessarily match. So, to use as calculated from the underlying sampling distribution in Eq. (7) constitutes to using an exact result as an approximation to a term in the approximation. In multiple dimensions this can lead to a breakdown in analyticity, causing the change in to become path-dependent in parameter space. Appendix C further discusses these consistency issues and potential mitigation strategies. In practice, one should include enough terms in to well-describe the parameter-dependent part of the sampling distribution. Then Eq. (7) should yield a good approximate likelihood. Reassurance in results obtained with the scheme might come through checking for stability under variation of the number of moments constrained. The quality of the likelihood might also be judged empirically by testing its performance on suitable simulations of the data.

## 3 Examples

Here we examine the theory for some test cases in which the true sampling distribution is actually known. The cases are motivated from cosmic microwave background (CMB) analysis, in which one computes power spectra of spherical harmonic coefficients of (assumed-Gaussian) fields on the sky (see e.g. [6, 7]).

### 3.1 Auto Power Spectrum Example

Imagine one has independent Gaussianly-distributed variables with zero mean and the same variance that we are wishing to learn about. Then the sampling distribution for the is just a Gaussian,

 p(y|C)d2l+1y=d2l+1y(2πC)(2l+1)/2e−∑iyi22C (14)

and we can see that a sufficient statistic ,

 ^C≡12l+1∑iyi2, (15)

exists with sampling distribution

 p(^C|C)d^C∝^Cl−1/2d^CCl+1/2e−(l+1/2)^CC. (16)

The associated minus-log-likelihood, normalized to zero at its minimum, is

 Strue=(l+1/2)(^CC+logC^C−1), (17)

where we have dropped a which will not affect the change in with respect to . Let us apply our method to this problem. We shall use for our , and choose to work to the lowest order possible, only constraining . From Eq. (14) we can calculate the mean and variance of :

 ⟨^C⟩ = C, (18) ⟨⟨^C2⟩⟩ = 22l+1C2. (19)

Substituting in to Eq. (13), we have

 ∂S∂C=−(^C−C)2l+12C2 (20)

and integrating this up actually reproduces the exact result (17)!

It is informative to repeat this exercise to the next level in approximation, considering constraining both and , requiring knowledge of up to the fourth cumulant of . One again recovers the exact result, the additional terms cancelling in the formula for .

The form (16) of the sampling distribution for is linear in in the exponent. Hence it is expressible exactly with only a finite number of terms in the form (5) that the scheme assumes, explaining why the procedure does so well in this case.

### 3.2 Correlated Power Spectra Example

The next case to consider is when (14) is generalised for to become a vector of components, between which cross-correlations may be present. Then the sampling distribution becomes

 Extra open brace or missing close brace (21)

with  being the covariance matrix for the components of . The components of the power spectra:

 Extra open brace or missing close brace (22)

are again seen to be sufficient statistics for inferences about .

Again, from knowledge of and as a function of the model our procedure recovers the true form of the likelihood:

 Strue=(l+1/2)⎛⎝trC−1^C+log|C|∣∣^C∣∣−1⎞⎠. (23)

Again, we only need the linear constraint; repeating the procedure to quadratic order yields the same result. One can check explicitly that the moments of  satisfy Eq. (50) of Appendix C, as needed to get both the first and second moments exactly right with only the linear constraint.

### 3.3 Cross-Spectrum Example

Our final example is more challenging and might be considered a non-trivial test of the scheme. Taking the case above, for a two-component vector, we may write  as

 C=(C+N11CCC+N22) (24)

and assume we are interested in making inferences about . (For example, we may imagine the two components to be measurements of the same underlying field contaminated with independent Gaussian noise.) We may not know the noise levels and well enough to trust using them in a full likelihood using all the components of . Instead, we may try and build a likelihood using the cross-spectrum alone. Such a likelihood will hopefully be less at risk of bias in inferences about . Actually, an analytic expression for the sampling distribution for is known (see [4] for a recent use in the context of the CMB) which we can use to compare our approximate likelihoods to.

In Fig. 1 we show the distribution of for 10,000 realizations and compare this to the aforementioned analytic result.

Given the Gaussianity of the ’s, we can compute cumulants of the cross spectrum:

 ⟨^C12⟩ = C, (25) (2l+1)⟨⟨^C212⟩⟩ = C2+(C+N11)(C+N22), (26) (2l+1)2⟨⟨^C312⟩⟩ = 2C3+6C(C+N11)(C+N22), (27) (2l+1)3⟨⟨^C412⟩⟩ = 6(C4+(C+N11)2(C+N22)2 (28) +6C2(C+N11)(C+N22))

and so on. Using such cumulants we can numerically integrate up (13) for a selection of degrees of approximation (linear to quartic, requiring from up to quadratic to up to 8th order cumulants) and for a variety of instructive data “realizations”. Indeed, one does well to remember that some data point could be well into the tail of the sampling distribution, particularly for multi-dimensional data. Therefore it is important to check the validity of a likelihood approximation for reasonable models when some of the data is rare. Shown in Figs. 2, 3 and 4 are posteriors for (assuming a uniform prior on ) for and respectively. It is interesting to note how well the approximations work, even for very rare data values. For the low tail value, the basic linear approximation behaves qualitatively correctly for plausible models, and as the degree of the approximation increases, the approximation approaches the true posterior. For the high tail value and particularly the middle value, even the linear approximation works relatively well.

## 4 Conclusion and Further Work

The technique presented here has some particular strengths:

Theoretically-underpinned

The principle of maximum entropy ensures that the procedure uses the information it is given and makes minimal assumptions beyond that.

Calculation-based

The approximation nowhere requires the use of simulations, rather it requires the calculation of cumulants (though one could indeed numerically estimate some of these, assuming a sufficient number of simulations are available, to use in the scheme if desired).

Extensible

By looking for any change in the distribution as one adds in further constraints, one can build up a feel of when the approximation is “good enough”. (Tests of the scheme against a limited number of realistic simulations can empirically build confidence in the approximation also.)

For multi-dimensional problems, it would be useful to understand the error in the log-likelihood approximation in more detail coming from the potential path-dependence of the result in the parameters plane. If this error could be estimated to be small it might be possible then, for example, to safely use a linear approximation instead of a quadratic one (even though the argument given in Appendix C suggests that the latter should be more generally applicable). A complementary step would be to develop options for manipulating higher cumulants in order to improve the analyticity of the approximation.

Applications to CMB analysis with multi-dimensional data and tests against simulations will be presented in [8].

## Acknowledgements

I thank Anthony Challinor, George Efstathiou and Antony Lewis for many helpful comments and discussions over the development of this work, and Terry Iles, Barry Nix and Andrew Pontzen for useful comments on a draft version of this paper.

## Appendix A Complementary Motivation for Matching Moments

Section 2 derives an approximate sampling distribution using maximum entropy, using Lagrange multipliers to enforce the matching of the moments of this approximate distribution with those calculated for the underlying one. Here we present a complementary motivation for matching moments. The starting point is the Kullback-Leiber divergence

 DKL(p,q)=−∫dnxp(x)logq(x)p(x) (29)

which quantifies how different the probability distribution is from . can be thought of as the mean of the difference in minus log probability between the approximation and the true with the average taken over , and is minimized for (see also the discussion in [9]). Now, imagine we wish to approximate with a form for compatible with Eq. (5), i.e.

 q(x)=p0(x)e−α−λixi−λijxixj−⋯ (30)

with a finite polynomial in the in the exponent. Here we think of and the , …  as parameters that we shall vary to minimize subject to the constraint that is normalized. Substituting into (29), we have:

 DKL(p,q)=∫dnxp(x)(α+λixi+λijxixj+⋯+logp(x)p0(x)). (31)

We can impose the normalization constraint with a Lagrange multiplier . Varying with respect to

 ∂∂α(DKL(p,q)−β∫dnxq(x)) = 0 (32) ⟹∫dnxp(x)−β∫dnxq(x) = 0 (33)

shows we must take . Varying with respect to and substituting in then tells us that

 ∫dnxp(x)xi…xj−∫dnxq(x)xi…xj. (34)

Thus after minimization the moments of the that appear in the exponent in (30) computed for the approximate distribution must match those computed for the underlying distribution. (Note that it is not necessary for the underlying distribution to be explicitly given, only that its appropriate moments be known.)

So, finding the broadest probability distribution consistent with constraints on certain moments yields the same distribution as that coming from minimizing the Kullback-Leibler divergence of the associated functional form from the unknown underlying distribution.

## Appendix B Solving for the Lagrange Multiplier Derivatives

Moments/cumulants of the and their derivatives are derivable from moments/cumulants of the and their derivatives. Indeed, cumulants of the are typically the things that are most straightforwardly obtainable from parametric models (or simulations). Hence it is useful to be able to relate -based quantities to -based ones. By inspection of Eq. (7), we see that

 ∂Z∂λij = −∂2Z∂λi∂λj, (35) ∂Z∂λijk = ∂3Z∂λi∂λj∂λk, (36)

or generally:

 −∂∂λi…j…−∂∂λp…qZ = −∂∂λi…−∂∂λj…−∂∂λp…−∂∂λqZ. (37)

This allows us to obtain general moments of the in terms of the moments of the , giving us a straightforward route to obtaining the cumulants of the needed for Eq. (13).

The relation (37) above suggests an alternative way of getting at the derivatives of the Lagrange multipliers with respect to the parameters: we may start with the cumulants of the , and then differentiate them with respect to the parameters . For example, with distributed according to Eq. (7), one has

 ∂⟨⟨xixj⟩⟩∂qa = ∂∂λr(∂2logZ∂λj∂λi)∂λr∂qa+ (38) ∂∂λrs(∂2logZ∂λj∂λi)∂λrs∂qa+ ….

Using commutativity of partial derivatives, the derivative in the second term for example may then be rewritten as:

 ∂2∂λj∂λi∂logZ∂λrs (39)

and then we may use

 ∂logZ∂λrs = 1Z∂Z∂λrs=−1Z∂2Z∂λs∂λr (40) = −1Z∂∂λs(Z∂logZ∂λr) (41) = −∂logZ∂λs∂logZ∂λr−∂2logZ∂λs∂λr (42)

to express the coefficient of in terms of cumulants of . With denoting derived from a distribution of the form in Eq. (7), we have:

 ⎛⎜ ⎜⎝κiκ(ij)⋮⎞⎟ ⎟⎠,a=−⎛⎜ ⎜ ⎜⎝κipκipq+2κi(pκq)⋯κijpκijpq+2κij(pκq)+2κi(pκq)j⋯⋮⋮⋱⎞⎟ ⎟ ⎟⎠⋅⎛⎜ ⎜⎝λpλ(pq)⋮⎞⎟ ⎟⎠,a. (43)

If we let denote the vector of cumulants of the and denote the vector of Lagrange multipliers, then we may write

 κ,a=−Mλ,a (44)

defining to be the big matrix on the right hand side of Eq. (43). Assuming is invertible, we thus have

 λ,a=−M−1κ,a, (45)

to be compared with Eq. (12).

### b.1 Cumulants as Parameters

It is often natural to choose some of the constrained cumulants as the parameters themselves. For example, one might imagine that some underlying theory determines all of the cumulants of the in terms of a small set of parameters. One might wish to compare different underlying models with different fundamental parameterizations against the same data set. In this case one might first construct a generic likelihood in which the cumulants of the are set directly. For example, if one uses unbiased estimators of parameters as the , then by construction their first moments are the parameters. The higher cumulants might then be functions of the same parameters.

### b.2 Expanding around a Gaussian

One can take the second derivative of Eq. (45) to find

 λ,ab=M−1M,bM−1κ,a−M−1κ,ab (46)

and hence expand the action to second order around a fiducial model. A natural choice for a fiducial model is a gaussian. Then, in conjunction with the suggestions above about using cumulants as the parameters themselves, we can compute the change in the action to second order in the higher cumulants and . The matrix is upper-diagonal for a gaussian and its inverse can be analytically computed. This expansion may be compared to an Edgeworth expansion.

## Appendix C Consistency of Approximation

Eq. (43) allows one to begin to see for which circumstances a mooted approximation is possible or not. Our procedure consists of setting some of the cumulants as functions of the parameters as desired, and then hoping we can find a set of corresponding and other cumulants such that Eq. (43) can be consistent. (There is some freedom in the cumulants corresponding to varying the “prior” term .)

Imagine for example we want a situation in which the dimension of the data matches the number of model parameters, and we think that a likelihood constructed only from constraints on the means of the should suffice. Then only the first (block-)column of the “big” matrix in Eq. (43) is relevant:

 ⎛⎜ ⎜⎝κiκ(ij)⋮⎞⎟ ⎟⎠,a=−⎛⎜ ⎜⎝κipκijp⋮⎞⎟ ⎟⎠⋅(λp),a. (47)

Given that we can compute covariances, the first (block-)row of Eq. (47) then allows us to solve for some putative . However, the second (block-)row of Eq. (47) also needs to be satisfied, and then the third and so on. One consistent solution, for example, occurs when the covariance is independent of the model and cumulants higher than second order vanish. In one dimension, for arbitrary mean and variance, we can actually successively determine higher and higher cumulants to formally solve all rows of Eq. (47).

A counting argument suggests a general solution however is impossible in dimensions greater than one. First, note that the cumulant with factors, each one of variables, has independent terms. Then the th block-row involves numbers on the left, the possible derivatives of each of the terms of the th order cumulant. But the th block-row of the big matrix has only numbers to vary, coming from the th order cumulant. This is not enough (for ), being a factor of too small. So, unless appropriate functional relations exist between the cumulants of the model, this form of desired likelihood is unattainable.

By a similar counting argument, allowing the approximate likelihood to involve quadratic constraints does not allow for solutions either. The second (block-)column introduces the th power cumulant, with its numbers, into play,

 ⎛⎜ ⎜⎝κiκ(ij)⋮⎞⎟ ⎟⎠,a=−⎛⎜ ⎜⎝κipκipq+2κi(pκq)κijpκijpq+2κij(pκq)+2κi(pκq)j⋮⋮⎞⎟ ⎟⎠⋅(λpλ(pq)),a. (48)

This is a factor relative to that needed to match the left hand side, for the th (block-)row. So, for sufficiently large , , there is not enough freedom available in the cumulant. Turning this around though, we might only expect difficulties to become acute below some dimensionality up to a given order in the cumulant. As the quadratic approximation needs up to the fourth cumulant, one might suspect the scheme has a chance of working reasonably well for .

For an alternative perspective, consider Eq. (10) again, If the right hand side is indeed to be the derivatives of analytic functions, we need:

 (λ,a),b = (λ,b),a i.e. (⟨⟨XXT⟩⟩−1⟨X⟩,a),b = (⟨⟨XXT⟩⟩−1⟨X⟩,b),a (49)

or

 ⟨⟨XXT⟩⟩,b⟨⟨XXT⟩⟩−1⟨X⟩,a=⟨⟨XXT⟩⟩,a⟨⟨XXT⟩⟩−1⟨X⟩,b (50)

to hold. Some of the higher cumulants might then better be chosen in such a way as to satisfy Eq. (50), rather than to be equal to those calculated from the underlying theory.

### Footnotes

1. Alternatively one could demand, for example, that only multipliers with non-decreasing indices are potentially nonzero.
2. We typically consider matching all moments up to a given order, but in certain circumstances one might wish to match only a subset of the moments. In a two-dimensional problem for example, with variables and , one might be able to calculate all the first and second moments, but only the “auto” cubic moments and and not the “cross” ones such as . In that case only the Lagrange multipliers corresponding to considered terms should be varied and the other ones, such   in this example, should be ignored/taken to be zero.

### References

1. E. T. Jaynes. Probability Theory: The Logic of Science. Cambridge University Press, 2003.
2. Samira Hamimeche and Antony Lewis. Likelihood Analysis of CMB Temperature and Polarization Power Spectra. Phys. Rev., D77:103013, 2008.
3. P. A. R. Ade et al. Planck 2013 results. XV. CMB power spectra and likelihood. Astron. Astrophys., 571:A15, 2014.
4. A. Mangilli, S. Plaszczynski, and M. Tristram. Large-scale cosmic microwave background temperature and polarization cross-spectra likelihoods. Mon. Not. Roy. Astron. Soc., 453(3):3174–3189, 2015.
5. W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes: The Art of Scientific Computing. Cambridge University Press, 1986.
6. P. A. R. Ade et al. Planck 2015 results. XIII. Cosmological parameters. Astron. Astrophys., 594:A13, 2016.
7. N. Aghanim et al. Planck 2015 results. XI. CMB power spectra, likelihoods, and robustness of parameters. Astron. Astrophys., 594:A11, 2016.
8. George Efstathiou and Steven Gratton. In preparation.
9. R. H. Leike and T. A. Enßlin. Optimal Belief Approximation. arXiv:1610.09018, 2016.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters