Bias in Discrete Surveys

The Bias of the Log Power Spectrum for Discrete Surveys

Abstract

A primary goal of galaxy surveys is to tighten constraints on cosmological parameters, and the power spectrum is the standard means of doing so. However, at translinear scales is blind to much of these surveys’ information – information which the log density power spectrum recovers. For discrete fields (such as the galaxy density), denotes the statistic analogous to the log density: is a ‘sufficient statistic’ in that its power spectrum (and mean) capture virtually all of a discrete survey’s information. However, the power spectrum of is biased with respect to the corresponding log spectrum for continuous fields, and to use to constrain the values of cosmological parameters, we require some means of predicting this bias. Here we present a prescription for doing so; for Euclid-like surveys (with cubical cells 16 Mpc across) our bias prescription’s error is less than 3 per cent. This prediction will facilitate optimal utilization of the information in future galaxy surveys.

keywords:
cosmology: theory – cosmological parameters – cosmology: miscellaneous
1

1 Introduction

Cosmology seeks to determine the precise values of cosmological parameters, and this endeavor demands full utilization of the information in galaxy surveys. The power spectrum of the overdensity field is the standard summary statistic for this purpose. However, pushing surveys to increasingly small scales does not proportionately increase the Fisher information in (Carron et al. 2015), since much of the survey’s information escapes from at translinear wavenumbers ().

To recover this information, Carron & Szapudi (2013) introduce the concept of sufficient statistics, namely, observables which capture all of a field’s information. They show that the log density closely approximates a sufficient statistic, so that its power spectrum (Neyrinck et al. 2009) and mean embody essentially all of the survey’s information. Repp & Szapudi (2017b) provide a simple fit for , and Repp & Szapudi (2017a) provide a related prescription for ; they also show that a Generalized Extreme Value (GEV) distribution describes well. Other characterizations of the continuous density field include those of Uhlemann et al. (2016), Shin et al. (2017), and Klypin et al. (2017).

However, the -statistic does not directly apply to discrete fields like the galaxy density, and thus Carron & Szapudi (2014) investigate , the optimal observable for discrete fields. Wolk et al. (2015) show that the power spectrum of this statistic is biased with respect to the (continuous) log spectrum . Since our ultimate purpose is to use to constrain the values of cosmological parameters, we require a means of predicting this power spectrum for any reasonable parameter set – and thus we require an accurate description of this bias.

This Letter presents an a priori means of predicting this -bias for surveys. Using multiple realizations of the Millennium Simulation dark matter distribution (Springel et al. 2005), we show that our prescription is accurate (within 5–6 per cent) for pixel number densities 1. For Euclid-like surveys, the accuracy is better than 3 per cent for survey cells with side length 16 Mpc.

The structure of this Letter is as follows: Section 2 characterizes the sufficient statistic , and Section 3 provides our bias prescription; we quantify its accuracy in Section 4 and summarize in Section 5.

2 The Discrete Sufficient Statistic A∗

Galaxy counts represent a discrete sampling of the underlying dark matter distribution. Constructing an optimal observable for such fields requires knowledge of two distributions. First, one must characterize the underlying continuous probability distribution – or, in log space, . Second, one must specify the conditional probability of observing objects in a survey cell given an underlying value of (or ): that is, one requires an expression for . The most widespread of such schemes is local Poisson sampling, where for a given dark matter log density ,

 P(N|A)=1N!(¯¯¯¯¯NeA)Nexp(−¯¯¯¯¯NeA), (1)

being the mean number of galaxies per survey cell. Following Carron & Szapudi (2014), we define as the value of that maximizes , so that for an observed number of objects in a cell, is the Bayesian reconstruction of the dark matter log density in that survey pixel. Carron & Szapudi (2014) show that closely approximates a sufficient statistic for galaxy surveys, and thus the power spectrum of the -field (together with its mean ) contains essentially all cosmological information present in the set of galaxy counts.

Carron & Szapudi (2014) specifically consider the case of a lognormal matter distribution with local Poisson sampling. They show that in this case, is the solution of the equation

 eA∗+A∗¯¯¯¯¯Nσ2A=N−1/2¯¯¯¯¯N. (2)

Although the lognormal distribution is a good description of the projected (two-dimensional) matter distribution, for the three-dimensional field a Generalized Extreme Value (GEV) distribution (Repp & Szapudi 2017a) fits better:

 P(A)=1σGt(A)1+ξe−t(A), (3)

where

 t(A)=(1+A−μGσGξ)−1/ξ. (4)

Here, , , and are, respectively, location, scale, and shape parameters which depend on the mean , variance , and skewness of as follows:

 γ1=−Γ(1−3ξ)−3Γ(1−ξ)Γ(1−2ξ)+2Γ3(1−ξ)(Γ(1−2ξ)−Γ2(1−ξ))3/2 (5)
 σG=σAξ⋅(Γ(1−2ξ)−Γ2(1−ξ))−1/2 (6)
 μG=¯¯¯¯A−σGΓ(1−ξ)−1ξ (7)

For Poisson sampling of a GEV distribution, we take Equations 1 and 3 as and . To calculate we can thus maximize the expression

 −lnσG+(1+ξ)lnt(A)−t(A)−lnN!−¯¯¯¯¯NeA+N(ln¯¯¯¯¯N+A); (8)

requiring the derivative with respect to to vanish, we obtain the following equation for in the GEV case:

 1σG(1+A∗−μGσGξ)−1−1ξ+N=1+ξσG+(A∗−μG)ξ+¯¯¯¯¯NeA∗. (9)

It is Equation 9 which we use, together with the fits in Repp & Szapudi (2017a), to calculate throughout this Letter; nevertheless, the bias formulas we derive in Section 3 are independent of the particular choice of or .

Since our ultimate purpose is to use the power spectrum of to constrain the values of cosmological parameters, we require a means of predicting this power spectrum for any reasonable parameter set. Repp & Szapudi (2017b) show how to predict , the power spectrum of the continuous log density field, to an accuracy of a few per cent. However, Fig. 1 shows that passage from to introduces three effects. The most salient of these is the bias of as a whole (Wolk et al. 2015), evident in the figure as a vertical shift. Also apparent is a discreteness plateau at high wavenumbers, analogous to the shot noise term in (though the analogy is inexact due to the nonlinearity of the log- and -transforms). Finally, there is a slight change of shape at intermediate wavenumbers, such that multiplying by the bias and then adding a constant does not completely reproduce the shape of .

The bias is the factor most important for parameter constraint and is thus the focus of this letter. Hence, ignoring high- effects, we write

 PA∗(k)=b2A∗PA(k). (10)

Note that by convention the bias factor applies to the underlying fields (), so it is which multiplies the power spectrum. Note also that Wolk et al. (2015) use to refer to the reciprocal of the quantity so denoted in Equation 10.

3 Predicting the Bias of A∗

Passing from Fourier to real space, Equation 10 implies that

 ξA∗(r)=b2A∗ξA(r), (11)

where is the two-point correlation function. Now let be the joint probability distribution function for two points separated by a distance . Then

 ξA(r)=∫dA1dA2(A1−¯¯¯¯A)(A2−¯¯¯¯A)f(A1,A2), (12)

where denotes the mean value of . To write the analogous expression for , we begin with the joint -distribution:

 fA∗(A∗1,A∗2)=∫dA1dA2f(A1,A2)P(A∗1|A1)P(A∗2|A2). (13)

Given a distribution and a mean number density , every natural number corresponds to a discrete value obtainable from Equation 9. We may thus slightly abuse the notation and, in the integrand of Equation 13, write instead of the corresponding :

 fA∗(A∗1,A∗2)=∫dA1dA2f(A1,A2)P(N1|A1)P(N2|A2), (14)

where and . Therefore the expression analogous to Equation 12 (summing over the discrete variable rather than integrating) is

 ξA∗(r) = ∑N1,N2(A∗1−¯¯¯¯¯¯A∗)(A∗2−¯¯¯¯¯¯A∗)fA∗(A∗1,A∗2) (16) = ∑N1,N2(A∗1−¯¯¯¯¯¯A∗)(A2−¯¯¯¯¯¯A∗) ×∫dA1dA2f(A1,A2)P(N1|A1)P(N2|A2),

where (as throughout this section) and are the values of corresponding to and , respectively.

We now require an expression for . If the correlation of and is weak (as expected on large scales), we can work to first order in . For a bivariate Gaussian distribution, expansion to this order yields

 f(A1,A2)=P(A1)P(A2)⎧⎨⎩1+ξA(r)(A1−¯¯¯¯A)(A2−¯¯¯¯A)σ4A⎫⎬⎭. (17)

Equation 17 is our ansatz for the joint distribution of in the weak correlation limit. In other words, we assume that the distribution has the (joint) behavior of a Gaussian to first order in .

To simplify notation, we temporarily define a discreteness factor that measures the fluctuations of given a particular value of :

 ⟨ΔA∗⟩A=∑N(A∗−¯¯¯¯¯¯A∗)P(N|A); (18)

here again we write for . Then Equation 16 becomes

 ξA∗(r) = ∫dA1dA2P(A1)P(A2)⟨ΔA∗⟩A1⟨ΔA∗⟩A2 (20) ×⎧⎨⎩1+ξA(r)(A1−¯¯¯¯A)(A2−¯¯¯¯A)σ4A⎫⎬⎭ = {∫dAP(A)⟨ΔA∗⟩A}2 +ξA(r)σ4A{∫dA(A−¯A)P(A)⟨ΔA∗⟩A}2.

The first integral in Equation 20 must vanish: by reference to Equation 18, this integral becomes

 ∫dA∑N(A∗−¯¯¯¯¯¯A∗)P(A)P(A∗|A)=0. (21)

Therefore

 ξA∗(r)=ξA(r)σ4A{∫dA(A−¯¯¯¯A)P(A)⟨ΔA∗⟩A}2, (22)

and expanding , we conclude that

 b2A∗=1σ4A{∑N∫dA(A−¯¯¯¯A)(A∗−¯¯¯¯¯¯A∗)P(N|A)P(A)}2. (23)

Once again note that we here write as an abbreviation . The calculation of the mean for use in Equation 23 is straightforward:

 ¯¯¯¯¯¯A∗=∑NA∗(N)∫dAP(A)P(N|A) (24)

Note also that we have written Equation 23 in a form conducive to calculating given and ; however, this form obscures the symmetrical roles of and . The symmetry becomes more apparent if we define the cross-correlation

 γAA∗≡1σAσA∗⟨(A−¯¯¯¯A)(A∗−¯¯¯¯¯¯A∗)⟩, (25)

allowing us to write

 b2A∗=ξA∗(r)ξA(r)=σ2A∗σ2Aγ2AA∗. (26)

If and were perfectly correlated, then would be the ratio of the variances. However, the cross-correlation must be imperfect because one distribution is continuous and the other discrete; at low number densities, this effect becomes most pronounced, because a large range of negative -values (theoretically unbounded below) map to . Thus for low number densities the bias is much smaller than the ratio of the variances, but in the opposite (continuum) limit, approaches and the bias approaches unity.

4 Accuracy and Limits

To test the accuracy of Equation 23, we obtain log dark matter densities from the Millennium Simulation (Springel et al. 2005). From these dark matter fields we generate multiple Poisson realizations for various mean pixel number densities . Given the near-GEV distribution of (Repp & Szapudi 2017a), we use Equation 9 to calculate the corresponding -field, the power spectrum of which we then measure. Fig. 2 compares (calculated with Equation 23) to the actual power spectrum for three values of ; clearly Equation 23 provides a reasonable estimate of the actual -bias as long as is not overly small.

We thus turn to more extensive quantification of the accuracy of this equation; we also compare its prescription with that of Wolk et al. (2015), who give

 b2A∗=⎛⎝1+1¯¯¯¯¯Nσ2A⎞⎠−2 (27)

for a Poisson-sampled lognormal field. (Strictly speaking, Equation 27 applies to projected two-dimensional data, for which the lognormality assumption is more warranted.) We note that in the limit of high number densities (in particular, when so that approaches ), the bias in both Equations 23 and 27 approaches unity.

We begin with the matter densities of the Millennium Simulation at three redshifts (, 1.0, and 2.1). We also specify a variety of (initial) average pixel number densities ranging from 0.01 to 30 objects per pixel; since the Millennium Simulation pixels measure Mpc on each side, these values yield volume number densities from 0.0013 to 4.0 Mpc, the lowest value corresponding roughly to the estimated Euclid number density (Laureijs et al. 2011). For each combination of redshift and initial number density, we then generate mock galaxy catalogs by Poisson sampling the dark matter field at the specified -values. Next, we rebin each of these catalogs to pixels with side length twice, four times, and eight times that of the original. In this manner we generate a range of mock catalogs with pixel side lengths ranging 1.95 to 15.6 Mpc and pixel number densities from 0.01 to .

For each catalog we then use Equation 9 to calculate for each pixel and obtain the -power spectra to compare with the known . To model the discreteness plateau, we use a Monte Carlo approach, populating the mock survey volume with a Poisson distribution having the requisite ; for this purely Poisson distribution we then calculate and its power spectrum, the (constant) value of which approximates the height of the plateau. The bias in each mock catalog is then the average value of , where we restrict the average to linear wavenumbers ( Mpc). In this way we attempt to avoid the intermediate- to high-wavenumber effects discussed in Section 2 and shown in Fig. 1 (namely, the discreteness plateau and shape change). Following this procedure, we generate enough mock catalogs (for each redshift and number density) to reduce the uncertainty of the measured bias to less than 0.5 per cent. When applying Equation 23 we use the Repp & Szapudi (2017a) prescription for the moments of . The results appear in Figure 3, which compares the measured biases to those predicted by Equations 23 and 27.

The prescriptions’ accuracy increases as we approach the continuum limit; in this limit , and both Equations 23 and 27 approach unity. At number densities particles per pixel, the error in Equation 27 begins to exceed 5 per cent. At the other extreme, the lowest number density we need realistically consider is , at which point shot noise overwhelms any power spectrum signal. Even down to this number density, Equation 23 retains its accuracy (in most cases) at the 5 per cent level, although the error rises to 6 per cent for in cells of side length 1.95 Mpc. The experiment most closely corresponding to Euclid (highlighted by boxes in Fig. 3) is the case of volume number density Mpc, measured in cells of side length 15.6 Mpc; the resultant pixel number density is , with an error in Equation 23 of less than 3 per cent.

5 Conclusion

Cosmological surveys probing small scales contain significant information which the standard power spectrum does not access. To retrieve this information, one must instead analyze a sufficient statistic (optimal observable), whose power spectrum and mean together capture virtually all cosmological information in the field.

The log overdensity is essentially optimal for continuous cosmological fields (Carron & Szapudi 2013), but for the discrete fields probed by galaxy surveys, the sufficient statistic is rather than (Carron & Szapudi 2014). One must therefore characterize the power spectrum to enable maximal utilization of galaxy surveys’ information. This power spectrum differs from that of in several respects, the most important of which is the multiplicative bias (Wolk et al. 2015).

In this Letter we have characterized this bias for discrete surveys (Equation 23) in terms of the underlying log distribution and the discretization scheme. By generating multiple discrete realizations of the Millennium Simulation dark matter distribution, we have shown the typical error of this prediction to be less than 5 per cent for pixel number densities per cell. For survey densities comparable to that of Euclid, the error is less than 3 per cent for survey cells of side length Mpc. This prediction is the most significant portion of the task of characterizing the -power spectrum.

Pixels of side length Mpc correspond to a maximum wavenumber Mpc, and Figure 1 indicates that begins to depart from around this point. Even analysis restricted to this scale represents an 8-fold information gain compared to the linear regime. A hypothetical -galaxy survey, on the other hand, might achieve number densities of Mpc, permitting analysis at much higher wavenumbers. For such a survey, proper treatment of the intermediate- and high- effects (shape change and discreteness plateau) would be important.

In future work we plan to characterize this high- discreteness plateau as well as the related shape change; the result will be a complete characterization of the -power spectrum. Other work will include a proper accounting for the effects on the log power spectrum of galaxy bias and redshift-space distortion. Preliminary investigation suggests that the log transform renders both of these effects more tractable than they would otherwise be.

In summary, is a sufficient statistic for galaxy surveys and thus captures all cosmological information inherent in such a survey. To access this information, one must be able to predict the power spectrum for various sets of cosmological parameters. This Letter provides a key component of this prediction, which will facilitate extraction of maximal information from dense galaxy surveys.

Acknowledgements

The Millennium Simulation data bases used in this Letter and the web application providing online access to them were constructed as part of the activities of the German Astrophysical Virtual Observatory (GAVO). IS acknowledges support from National Science Foundation (NSF) award 1616974.

Footnotes

1. pagerange: The Bias of the Log Power Spectrum for Discrete SurveysReferences

References

1. Carron J., Szapudi I., 2013, MNRAS, 434, 2961
2. Carron J., Szapudi I., 2014, MNRAS, 439, L11
3. Carron J., Wolk M., Szapudi I., 2015, MNRAS, 453, 450
4. Klypin A., Prada F., Betancort-Rijo J., Albareti F. D., 2017, preprint, (arXiv:1706.01909)
5. Laureijs R., et al., 2011, preprint, (arXiv:1110.3193)
6. Neyrinck M. C., Szapudi I., Szalay A. S., 2009, ApJ, 698, L90
7. Repp A., Szapudi I., 2017a, preprint, (arXiv:1705.08015)
8. Repp A., Szapudi I., 2017b, MNRAS, 464, L21
9. Shin J., Kim J., Pichon C., Jeong D., Park C., 2017, preprint, (arXiv:1705.06863)
10. Springel V., White S. D. M., Jenkins A., Frenk C. S., Yoshida N., Gao L., Navarro J., et al., 2005, Nature, 435, 629
11. Uhlemann C., Codis S., Pichon C., Bernardeau F., Reimberg P., 2016, MNRAS, 460, 1529
12. Wolk M., Carron J., Szapudi I., 2015, MNRAS, 454, 560
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters