Bias in Discrete Surveys

The Bias of the Log Power Spectrum for Discrete Surveys


A primary goal of galaxy surveys is to tighten constraints on cosmological parameters, and the power spectrum is the standard means of doing so. However, at translinear scales is blind to much of these surveys’ information – information which the log density power spectrum recovers. For discrete fields (such as the galaxy density), denotes the statistic analogous to the log density: is a ‘sufficient statistic’ in that its power spectrum (and mean) capture virtually all of a discrete survey’s information. However, the power spectrum of is biased with respect to the corresponding log spectrum for continuous fields, and to use to constrain the values of cosmological parameters, we require some means of predicting this bias. Here we present a prescription for doing so; for Euclid-like surveys (with cubical cells 16 Mpc across) our bias prescription’s error is less than 3 per cent. This prediction will facilitate optimal utilization of the information in future galaxy surveys.

cosmology: theory – cosmological parameters – cosmology: miscellaneous

1 Introduction

Cosmology seeks to determine the precise values of cosmological parameters, and this endeavor demands full utilization of the information in galaxy surveys. The power spectrum of the overdensity field is the standard summary statistic for this purpose. However, pushing surveys to increasingly small scales does not proportionately increase the Fisher information in (Carron et al. 2015), since much of the survey’s information escapes from at translinear wavenumbers ().

To recover this information, Carron & Szapudi (2013) introduce the concept of sufficient statistics, namely, observables which capture all of a field’s information. They show that the log density closely approximates a sufficient statistic, so that its power spectrum (Neyrinck et al. 2009) and mean embody essentially all of the survey’s information. Repp & Szapudi (2017b) provide a simple fit for , and Repp & Szapudi (2017a) provide a related prescription for ; they also show that a Generalized Extreme Value (GEV) distribution describes well. Other characterizations of the continuous density field include those of Uhlemann et al. (2016), Shin et al. (2017), and Klypin et al. (2017).

However, the -statistic does not directly apply to discrete fields like the galaxy density, and thus Carron & Szapudi (2014) investigate , the optimal observable for discrete fields. Wolk et al. (2015) show that the power spectrum of this statistic is biased with respect to the (continuous) log spectrum . Since our ultimate purpose is to use to constrain the values of cosmological parameters, we require a means of predicting this power spectrum for any reasonable parameter set – and thus we require an accurate description of this bias.

This Letter presents an a priori means of predicting this -bias for surveys. Using multiple realizations of the Millennium Simulation dark matter distribution (Springel et al. 2005), we show that our prescription is accurate (within 5–6 per cent) for pixel number densities 1. For Euclid-like surveys, the accuracy is better than 3 per cent for survey cells with side length 16 Mpc.

The structure of this Letter is as follows: Section 2 characterizes the sufficient statistic , and Section 3 provides our bias prescription; we quantify its accuracy in Section 4 and summarize in Section 5.

2 The Discrete Sufficient Statistic

Galaxy counts represent a discrete sampling of the underlying dark matter distribution. Constructing an optimal observable for such fields requires knowledge of two distributions. First, one must characterize the underlying continuous probability distribution – or, in log space, . Second, one must specify the conditional probability of observing objects in a survey cell given an underlying value of (or ): that is, one requires an expression for . The most widespread of such schemes is local Poisson sampling, where for a given dark matter log density ,


being the mean number of galaxies per survey cell. Following Carron & Szapudi (2014), we define as the value of that maximizes , so that for an observed number of objects in a cell, is the Bayesian reconstruction of the dark matter log density in that survey pixel. Carron & Szapudi (2014) show that closely approximates a sufficient statistic for galaxy surveys, and thus the power spectrum of the -field (together with its mean ) contains essentially all cosmological information present in the set of galaxy counts.

Carron & Szapudi (2014) specifically consider the case of a lognormal matter distribution with local Poisson sampling. They show that in this case, is the solution of the equation


Although the lognormal distribution is a good description of the projected (two-dimensional) matter distribution, for the three-dimensional field a Generalized Extreme Value (GEV) distribution (Repp & Szapudi 2017a) fits better:




Here, , , and are, respectively, location, scale, and shape parameters which depend on the mean , variance , and skewness of as follows:


For Poisson sampling of a GEV distribution, we take Equations 1 and 3 as and . To calculate we can thus maximize the expression


requiring the derivative with respect to to vanish, we obtain the following equation for in the GEV case:


It is Equation 9 which we use, together with the fits in Repp & Szapudi (2017a), to calculate throughout this Letter; nevertheless, the bias formulas we derive in Section 3 are independent of the particular choice of or .

Figure 1: The power spectra of (dashed) and of (solid) for a Poisson sampling () of the Millennium Simulation (). The overall bias (amplitude shift) of is evident, as is the high- discreteness flattening. The dotted curve shows that simply applying a bias and an additive constant to does not fully reproduce the shape of beyond .

Since our ultimate purpose is to use the power spectrum of to constrain the values of cosmological parameters, we require a means of predicting this power spectrum for any reasonable parameter set. Repp & Szapudi (2017b) show how to predict , the power spectrum of the continuous log density field, to an accuracy of a few per cent. However, Fig. 1 shows that passage from to introduces three effects. The most salient of these is the bias of as a whole (Wolk et al. 2015), evident in the figure as a vertical shift. Also apparent is a discreteness plateau at high wavenumbers, analogous to the shot noise term in (though the analogy is inexact due to the nonlinearity of the log- and -transforms). Finally, there is a slight change of shape at intermediate wavenumbers, such that multiplying by the bias and then adding a constant does not completely reproduce the shape of .

The bias is the factor most important for parameter constraint and is thus the focus of this letter. Hence, ignoring high- effects, we write


Note that by convention the bias factor applies to the underlying fields (), so it is which multiplies the power spectrum. Note also that Wolk et al. (2015) use to refer to the reciprocal of the quantity so denoted in Equation 10.

3 Predicting the Bias of

Passing from Fourier to real space, Equation 10 implies that


where is the two-point correlation function. Now let be the joint probability distribution function for two points separated by a distance . Then


where denotes the mean value of . To write the analogous expression for , we begin with the joint -distribution:


Given a distribution and a mean number density , every natural number corresponds to a discrete value obtainable from Equation 9. We may thus slightly abuse the notation and, in the integrand of Equation 13, write instead of the corresponding :


where and . Therefore the expression analogous to Equation 12 (summing over the discrete variable rather than integrating) is


where (as throughout this section) and are the values of corresponding to and , respectively.

We now require an expression for . If the correlation of and is weak (as expected on large scales), we can work to first order in . For a bivariate Gaussian distribution, expansion to this order yields


Equation 17 is our ansatz for the joint distribution of in the weak correlation limit. In other words, we assume that the distribution has the (joint) behavior of a Gaussian to first order in .

To simplify notation, we temporarily define a discreteness factor that measures the fluctuations of given a particular value of :


here again we write for . Then Equation 16 becomes


The first integral in Equation 20 must vanish: by reference to Equation 18, this integral becomes




and expanding , we conclude that


Once again note that we here write as an abbreviation . The calculation of the mean for use in Equation 23 is straightforward:


Note also that we have written Equation 23 in a form conducive to calculating given and ; however, this form obscures the symmetrical roles of and . The symmetry becomes more apparent if we define the cross-correlation


allowing us to write


If and were perfectly correlated, then would be the ratio of the variances. However, the cross-correlation must be imperfect because one distribution is continuous and the other discrete; at low number densities, this effect becomes most pronounced, because a large range of negative -values (theoretically unbounded below) map to . Thus for low number densities the bias is much smaller than the ratio of the variances, but in the opposite (continuum) limit, approaches and the bias approaches unity.

4 Accuracy and Limits

Figure 2: Comparison, for three -values, of measured spectra (solid curves) with the spectra (dashed curves) predicted by Equation 23. The colored curves represent the mean of multiple independent Poisson samplings of the Millennium Simulation dark matter field ().

To test the accuracy of Equation 23, we obtain log dark matter densities from the Millennium Simulation (Springel et al. 2005). From these dark matter fields we generate multiple Poisson realizations for various mean pixel number densities . Given the near-GEV distribution of (Repp & Szapudi 2017a), we use Equation 9 to calculate the corresponding -field, the power spectrum of which we then measure. Fig. 2 compares (calculated with Equation 23) to the actual power spectrum for three values of ; clearly Equation 23 provides a reasonable estimate of the actual -bias as long as is not overly small.

We thus turn to more extensive quantification of the accuracy of this equation; we also compare its prescription with that of Wolk et al. (2015), who give


for a Poisson-sampled lognormal field. (Strictly speaking, Equation 27 applies to projected two-dimensional data, for which the lognormality assumption is more warranted.) We note that in the limit of high number densities (in particular, when so that approaches ), the bias in both Equations 23 and 27 approaches unity.

We begin with the matter densities of the Millennium Simulation at three redshifts (, 1.0, and 2.1). We also specify a variety of (initial) average pixel number densities ranging from 0.01 to 30 objects per pixel; since the Millennium Simulation pixels measure Mpc on each side, these values yield volume number densities from 0.0013 to 4.0 Mpc, the lowest value corresponding roughly to the estimated Euclid number density (Laureijs et al. 2011). For each combination of redshift and initial number density, we then generate mock galaxy catalogs by Poisson sampling the dark matter field at the specified -values. Next, we rebin each of these catalogs to pixels with side length twice, four times, and eight times that of the original. In this manner we generate a range of mock catalogs with pixel side lengths ranging 1.95 to 15.6 Mpc and pixel number densities from 0.01 to .

For each catalog we then use Equation 9 to calculate for each pixel and obtain the -power spectra to compare with the known . To model the discreteness plateau, we use a Monte Carlo approach, populating the mock survey volume with a Poisson distribution having the requisite ; for this purely Poisson distribution we then calculate and its power spectrum, the (constant) value of which approximates the height of the plateau. The bias in each mock catalog is then the average value of , where we restrict the average to linear wavenumbers ( Mpc). In this way we attempt to avoid the intermediate- to high-wavenumber effects discussed in Section 2 and shown in Fig. 1 (namely, the discreteness plateau and shape change). Following this procedure, we generate enough mock catalogs (for each redshift and number density) to reduce the uncertainty of the measured bias to less than 0.5 per cent. When applying Equation 23 we use the Repp & Szapudi (2017a) prescription for the moments of . The results appear in Figure 3, which compares the measured biases to those predicted by Equations 23 and 27.

Figure 3: Per cent error in two prescriptions for the -bias, obtained by measuring the bias from multiple discrete realizations of the Millennium Simulation dark matter distribution (see text). Circles denote the per cent error of Equation 23; diamonds denote the per cent error of Equation 27. The shaded region highlights the range of less than 5 per cent error; Equation 23 typically falls within this range for per cell. Black squares highlight the the symbols corresponding to a Euclid-like number density in survey cells of side length 15.6 Mpc. In most cases the error bars are smaller than the markers.

The prescriptions’ accuracy increases as we approach the continuum limit; in this limit , and both Equations 23 and 27 approach unity. At number densities particles per pixel, the error in Equation 27 begins to exceed 5 per cent. At the other extreme, the lowest number density we need realistically consider is , at which point shot noise overwhelms any power spectrum signal. Even down to this number density, Equation 23 retains its accuracy (in most cases) at the 5 per cent level, although the error rises to 6 per cent for in cells of side length 1.95 Mpc. The experiment most closely corresponding to Euclid (highlighted by boxes in Fig. 3) is the case of volume number density Mpc, measured in cells of side length 15.6 Mpc; the resultant pixel number density is , with an error in Equation 23 of less than 3 per cent.

5 Conclusion

Cosmological surveys probing small scales contain significant information which the standard power spectrum does not access. To retrieve this information, one must instead analyze a sufficient statistic (optimal observable), whose power spectrum and mean together capture virtually all cosmological information in the field.

The log overdensity is essentially optimal for continuous cosmological fields (Carron & Szapudi 2013), but for the discrete fields probed by galaxy surveys, the sufficient statistic is rather than (Carron & Szapudi 2014). One must therefore characterize the power spectrum to enable maximal utilization of galaxy surveys’ information. This power spectrum differs from that of in several respects, the most important of which is the multiplicative bias (Wolk et al. 2015).

In this Letter we have characterized this bias for discrete surveys (Equation 23) in terms of the underlying log distribution and the discretization scheme. By generating multiple discrete realizations of the Millennium Simulation dark matter distribution, we have shown the typical error of this prediction to be less than 5 per cent for pixel number densities per cell. For survey densities comparable to that of Euclid, the error is less than 3 per cent for survey cells of side length Mpc. This prediction is the most significant portion of the task of characterizing the -power spectrum.

Pixels of side length Mpc correspond to a maximum wavenumber Mpc, and Figure 1 indicates that begins to depart from around this point. Even analysis restricted to this scale represents an 8-fold information gain compared to the linear regime. A hypothetical -galaxy survey, on the other hand, might achieve number densities of Mpc, permitting analysis at much higher wavenumbers. For such a survey, proper treatment of the intermediate- and high- effects (shape change and discreteness plateau) would be important.

In future work we plan to characterize this high- discreteness plateau as well as the related shape change; the result will be a complete characterization of the -power spectrum. Other work will include a proper accounting for the effects on the log power spectrum of galaxy bias and redshift-space distortion. Preliminary investigation suggests that the log transform renders both of these effects more tractable than they would otherwise be.

In summary, is a sufficient statistic for galaxy surveys and thus captures all cosmological information inherent in such a survey. To access this information, one must be able to predict the power spectrum for various sets of cosmological parameters. This Letter provides a key component of this prediction, which will facilitate extraction of maximal information from dense galaxy surveys.


The Millennium Simulation data bases used in this Letter and the web application providing online access to them were constructed as part of the activities of the German Astrophysical Virtual Observatory (GAVO). IS acknowledges support from National Science Foundation (NSF) award 1616974.


  1. pagerange: The Bias of the Log Power Spectrum for Discrete SurveysReferences


  1. Carron J., Szapudi I., 2013, MNRAS, 434, 2961
  2. Carron J., Szapudi I., 2014, MNRAS, 439, L11
  3. Carron J., Wolk M., Szapudi I., 2015, MNRAS, 453, 450
  4. Klypin A., Prada F., Betancort-Rijo J., Albareti F. D., 2017, preprint, (arXiv:1706.01909)
  5. Laureijs R., et al., 2011, preprint, (arXiv:1110.3193)
  6. Neyrinck M. C., Szapudi I., Szalay A. S., 2009, ApJ, 698, L90
  7. Repp A., Szapudi I., 2017a, preprint, (arXiv:1705.08015)
  8. Repp A., Szapudi I., 2017b, MNRAS, 464, L21
  9. Shin J., Kim J., Pichon C., Jeong D., Park C., 2017, preprint, (arXiv:1705.06863)
  10. Springel V., White S. D. M., Jenkins A., Frenk C. S., Yoshida N., Gao L., Navarro J., et al., 2005, Nature, 435, 629
  11. Uhlemann C., Codis S., Pichon C., Bernardeau F., Reimberg P., 2016, MNRAS, 460, 1529
  12. Wolk M., Carron J., Szapudi I., 2015, MNRAS, 454, 560
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Comments 0
Request comment
The feedback must be of minumum 40 characters
Add comment
Loading ...

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description