Bayesian inference of cosmic density fields from non-linear, scale-dependent, and stochastic biased tracers
Abstract
We present a Bayesian reconstruction algorithm to generate unbiased samples of the underlying dark matter field from halo catalogues. Our new contribution consists of implementing a non-Poisson likelihood including a deterministic non-linear and scale-dependent bias. In particular we present the Hamiltonian equations of motions for the negative binomial (NB) probability distribution function. This permits us to efficiently sample the posterior distribution function of density fields given a sample of galaxies using the Hamiltonian Monte Carlo technique implemented in the argo code. We have tested our algorithm with the Bolshoi -body simulation at redshift , inferring the underlying dark matter density field from subsamples of the halo catalogue with biases smaller and larger than one. Our method shows that we can draw closely unbiased samples (compatible within 1-) from the posterior distribution up to scales of about 1 Mpc in terms of power-spectra and cell-to-cell correlations. We find that a Poisson likelihood including a scale-dependent non-linear deterministic bias can yield reconstructions with power spectra deviating more than 10% at Mpc. Our reconstruction algorithm is especially suited for emission line galaxy data for which a complex non-linear stochastic biasing treatment beyond Poissonity becomes indispensable.
1Introduction
The large-scale structure encodes the key information to understand structure formation and the expansion of the Universe. However, the luminous objects, i.e., the galaxies tracing the large-scale structure, represent only a biased fraction of the total underlying matter governing the laws of gravity.
In the era of precision cosmology, the data analysis methods need to account for the non-linear, biased and discrete nature of the distribution of galaxies to accurately extract any valuable cosmological information. This becomes even more important with the advent of the new generation of galaxy surveys. Many of these surveys rely on emission line galaxies, see WiggleZ
We present in this work a Bayesian approach designed to deal with non-linear stochastic biased tracers. We consider non-Poisson probability distribution functions (PDFs) for the likelihood modeling the distribution of galaxies. In this way we account for the over-dispersion (larger dispersion than Poisson) of galaxy counts. We model the expected galaxy number density relating it to the dark matter density through a non-linear scale-dependent expression extracted from -body simulations (see [?]; [?]; [?]; [?]; [?]). In this way we extend the works based on the Poisson and linear bias models ([?]; [?]; [?]; [?]) following the ideas presented in [?]. In particular, we implement these improvements in the argo Hamiltonian-sampling code, which is able to jointly infer density, peculiar velocity fields and power-spectra ([?]). For the prior distribution describing structure formation of the dark matter field we use the lognormal assumption ([?]). We note however, that this prior can be substituted by another one, e.g., based on Lagrangian perturbation theory (see [?]; [?]; [?]; [?]). Alternatively, one can extend the lognormal assumption in an Edgeworth expansion to include higher order correlation functions ([?]; [?]). We show in this work that our likelihood model is able of yielding unbiased dark matter field reconstructions on Mpc scales based on -body simulations.
This paper is structured as follows. In § Section 2, we describe our statistical approach. Then we present our numerical tests (§ Section 3) and finally we present our conclusions (§ Section 4).
2Method
To infer the dark matter density field from biased tracers such as galaxies or halos, one has to define a target distribution, called posterior PDF. We use the Bayesian framework to express the posterior distribution based on a model PDF for the data, the likelihood, and a model PDF for the signal, the prior. We then recap the Hamiltonian sampling technique used in this work to sample from such a PDF, and present the Hamiltonian equations of motions for our model.
2.1Bayesian approach: the posterior distribution function
In this work we will restrict ourselves to the reconstruction of dark matter fields given a set of biased tracers in real-space. We note that redshift-space distortions can be corrected within a Gibbs-sampling scheme and leave this additional complication for a later work [?].
Let us divide the volume under consideration into a grid with cells. Our input data vector is given by the number counts of halos or galaxies per cell N and the desired signal is the dark matter density . In addition, we need to assume some model for the dark matter distribution and for the bias relating the number counts of galaxies to the underlying dark matter distribution N, including the nonlinear deterministic and stochastic parameters (see §Section 2.2 and §Section 2.3). We will assume in this work that there is one population of tracers and that they can be described with a set of bias parameters. Nevertheless, we show in an appendix (see §Appendix A) that we could sub-divide the sample into various tracer types, each one with its own bias parameters, and combine them in a multi-tracer analysis in a straightforward way, within the methodology presented here. The posterior distribution function of dark matter fields given N, and N can be expressed within the Bayesian framework as the product between the prior and the likelihood up to a normalization
2.2The Likelihood: stochastic bias
Let us start defining the likelihood, which models the statistical nature of the data. First we assume that the biasing relation is known and define the expected number count by N, where NN denotes the ensemble average over the halo or galaxy realization. At this point we need to introduce the stochastic biasing parameters , necessary to model the deviation from the Poisson distribution NNN. The positive (negative) correlation of halos at sub-grid resolution introduces over- (under-) dispersed distributions depending on the halo population and density regime [?]. This effect was already predicted by [?]. We focus on modeling over-dispersion, as under-dispersion is a sub-dominant effect, only present for very massive objects [?]. We note, that stochastic bias has been studied in a number of works, more or less explicitly [?].
For a given distribution function with expectation value , observed number count N and the set of stochastic bias parameters , the likelihood can be written as follows
The product is computed over the number of cells , that corresponds to the number of dimensions of the problem.
Let us consider the negative binomial distribution (NB) and the gravitational thermodynamics distribution by [?] (GT), to describe the deviation from Poissonity, which require a single stochastic bias parameter and , respectively. The Poisson the NB and the GT distributions are written as
respectively. The parameters and are connected to the expectation value and the variance by and , respectively. This implies that the over-dispersion term shows a quadratic and a linear dependence on the expected halo number count for the NB and the GT case, respectively. To obtain a different dependence, one could take the NB expression and include a dependence of on . For we find that the NB and the GT PDFs are equivalent in terms of over-dispersion. For this reason we will focus in the following on the NB PDF and leave a study on the dependence for a later work. One could investigate these dependencies for different population of halos, by taking for instance, ensembles of high resolution simulations like those in [?].
2.3Link between prior and likelihood: deterministic bias
Let us now define the link between the likelihood and the desired signal, i.e., the dark matter density field. This is given by the deterministic bias relating the expected number counts to the dark matter over-density field, which is in general non-linear, scale dependent and non-local. We note that non-locality introduces a scatter which can be absorbed in the stochastic bias [?].
One could expand the dark matter overdensity field (with being the mean dark matter density) in a Taylor series [?]. Alternatively one could follow [?] and expand the series based on the logarithm of the density field (avoiding in this way negative densities allowed in the previous expansion). This model to linear order, corresponding to a power-law of the dark matter field, has been proposed for resolution augmentation of -body simulations [?]. The power law can be interpreted as a linear bias factor in Lagrangian space which undergoes gravitational evolution within the lognormal approximation. It has recently been found that the bias is very well fit by a compact relation including an exponential cut-off: [?], which can be approximated by a Heaviside step-function [?]. This model is in concordance with the [?] and the peak background split formalism [?], which permit the formation of haloes only above a certain density threshold. It has recently been shown to be a crucial ingredient in the halo three point statistics [?]. The deterministic bias model including the density power-law and the density cut-off is given by the following expression for the expected halo/galaxy density:
where is the completeness at each cell . In this way our method is able of dealing with incomplete data samples as well [?]. The normalization ensuring a particular mean number count is given by
where denotes the ensemble average over all possible dark matter realizations (cosmic variance).
2.4The Prior: statistics of dark matter fields
Let us define now the model describing the distribution of dark matter density fields . The Gaussian assumption has since long been commonly used in the literature. This assumption is for instance present in the Wiener reconstruction method [?]. Nevertheless, it is well known that gravity induces deviations from Gaussianity. For this reason a non-Gaussian model is essential to exploit the accuracy of our likelihood model. For the sake of simplicity we will restrict this work to the lognormal assumption (see [?], and its implementation within a Bayesian context [?]). We note, however, that any structure formation model can be implemented at this point as has been shown in a series of recent works [?]. Moreover, the lognormal assumption has recently been shown to satisfy sufficient statistics [?], and thereby extract the maximal cosmological information. However, it is known that the lognormal approximation fails at modeling the large-scale structure in the low density regime [?]. However, it has been shown to be a good approximation for the moderate to high density regime [?]. We leave an extension with more complex models like those including higher order statistics for later work [?].
Let us define the signal distribution s as a logarithmic transformation of the matter density field
with the mean field ensuring that the mean of s vanishes, s. The Gaussian prior for the logarithmically transformed density field is then given as
where the covariance matrix ss (or power spectrum in Fourier-space) depends on the set of cosmological parameters , where ss denotes the ensemble average over all possible lognormal fields. In practice, we assume a linear power spectrum and the corresponding covariance matrix.
2.5Sampling from the posterior distribution function
(0,153) (0,87.5) (0,19.5)10 |
(0,153) (0,87.5) (0,19.5)10 |
To sample from the posterior distribution we rely on the Hamiltonian Monte Carlo sampling technique (HMC), which is a hybrid Markov Chain Monte Carlo (MCMC) approach that uses physically motivated equations of motion to explore the phase space avoiding inefficient random walks. Let us recap here the basics of Hamiltonian sampling. For more details we refer to [?], [?] and more recently [?]). For applications to astronomy we refer to, e.g., [?], [?], and [?]. Here we briefly point to the main features and also emphasize the modifications we implemented. The HMC approach treats the problem as a thermodynamical system. The contact with a heat bath moves the system into equilibrium, i.e., to the statistical space in which samples are drawn from the posterior distribution x. The key idea of the HMC approach consists of moving the system by solving the Hamiltonian equations of motions, which involve a stochastic kinetic term. Let us define in this physical analogy the potential (), kinetic () and Hamiltonian () energy through the following relations
where x is the potential function at the coordinate vector x and p is the kinetic term of the momentum p of the form . Combining Eqs. Equation 3 & ?, we see that the target distribution x can be inferred from
This Eq. shows that drawing samples from yields the desired distribution if one is withdrawing the kinetic term, by marginalizing over the momenta. The momenta are drawn from a multivariate Gaussian distribution with covariance matrix . In order to explore the high dimensional phase space we need to evolve the system. Therefore we solve the Hamiltonian equations of motion
Having fully evolved the initial position x_0,p_0 of the system, we apply a criterion to accept or withdraw the new point in phase space x_1,p_1 that writes
2.6Hamiltonian Equations of motion
As shown in Equation 3 ,Equation 5 , ? and Equation 6, we need to calculate the logarithm of the target distribution x and its gradient term. Let us summarize below the potential energy terms and the corresponding analytic gradient expressions required in our work.
Gaussian prior
We calculate the negative logarithm of Equation 2 as
where incorporates all constant terms of the normalization.
The derivative w.r.t. the signal s writes
Here we substituted the dummy variable with the logarithmic signal variable s we want to evaluate. We note, that more complex schemes connecting the initial conditions with the final gravitationally evolved density fields assume this prior for the primordial fluctuations, but include the Lagrangian to Eulerian mapping in the likelihood (see [?]; [?]; [?]; [?]). We will show below the likelihood expressions for the lognormal case, but include more general expressions in the appendix (see §Appendix B).
Poisson Likelihood
The terms for the Poisson likelihood write as follows
Finally the derivative w.r.t. to signal variable s writes as
Non-Poisson Likelihood: Negative Binomial
We calculate the corresponding terms for the NB distribution:
Corresponding to Equation 8 we can write the derivative of these likelihood functions as
2.7Leap frog scheme
To explore the phase space and solve iteratively the Hamiltonian equations of motion we use the leapfrog scheme
with being the number of steps, the step size and the total pseudo time given by .
(-70,133)— (-60,133)body DM (-69.5,126)- - (-60,126)argo NB (-71,119) (-60,119)argo Poisson (-70,112)— (-60,112)body Haloes (-69.5,105)- - (-60,105)argo Poisson-unity (-208,135) (-208,93) (-208,50) (-206,34) (-206,29) (-206,24) (-200,20) (-206,16) (-206,11) (-206,6) (-220,90) (-112,-15) (-225,20) (-216,20) (-126,-5) (-5,-5) |
(-70,133)— (-60,133)body DM (-69.5,126)- - (-60,126)argo NB&threshold (-70,119)— (-60,119)body Haloes (-208,131) (-208,91) (-208,50) (-206,38) (-200,30) (-206,22) (-200,14) (-206,6) (-220,90) (-112,-15) (-225,20) (-216,20) (-126,-5) (-5,-5) |
3Validation of the method
(-75,20)N (-160,70) (-148,142)7 (-148,122)6 (-148,102)5 (-148,81)4 (-148,59)3 (-148,38)2 (-148,18)1 (-148,-1)0 (-75,20) (0,109) (0,73) (0,35)10 (0,-1)1 |
(-160,70) (-85,20)N (-95,-17) N (-117,-7)2 (-88,-7)4 (-60,-7)6 (-31,-7)8 (-6,-7)10 (-148,142)7 (-148,122)6 (-148,102)5 (-148,81)4 (-148,59)3 (-148,38)2 (-148,18)1 (-148,-1)0 (-117,-7)2 (-88,-7)4 (-60,-7)6 (-31,-7)8 (-6,-7)10 (-75,20) (-95,-17) N (-117,-7)2 (-88,-7)4 (-60,-7)6 (-31,-7)8 (-6,-7)10 (0,109) (0,73) (0,35)10 (0,-1)1 |
To validate our algorithm we take the Bolshoi dark matter simulation [?] at redshift , which was created using following cosmological parameters: .
We consider two different subsamples of the halo catalogue, named and , created with the bound-density-maxima BDM halo finder [?] as inputs to our method.
For we randomly select haloes of the masses between to . These mass limits represent the total mass range of the Boshoi simulation (see Appendix D) and yield a bias lower than one (see Fig. ?)
Additionally we consider the subsample with a mass cut of (see Appendix D), resulting in haloes and yielding a bias larger than one (see Fig. ?). Subsample permits us to test threshold bias described in Equation 1, which is negligible for subsample . To technically overcome divergencies caused by the threshold bias we introduce a Gibbs sampling procedure, described in Appendix B.3
We use the nearest-grid-point (NGP) scheme to compute the number density for each cell on a mesh with cells. We use the dark matter distribution from the Bolshoi simulation as a reference to estimate the accuracy of our reconstructions and define it as the true dark matter field.
To test the different models described in §Section 2, we apply our novel argo code running a series of Hamiltonian Markov chains with different likelihoods and bias assumptions:
Poisson likelihood and unity bias using subsample ,
Poisson likelihood and power-law bias using subsample ,
NB likelihood and power-law bias using subsample ,
NB likelihood and power-law bias including thresholding using subsample .
We disregard the first 2000 iterations of the chains until the power spectra have converged and use a total of 10000 iterations for our analysis. The convergence behavior is estimated through the [?] test in the appendix (see §Appendix C).
Fig. ? shows a slice through the volume of the Bolshoi simulation. This figure illustrates the problem, as the haloes (middle panel) represent in our case-study only 4 to 5 orders of magnitude less matter tracers than the dark matter particles used for the simulation (left panel). We have plotted the means of an ensemble of 10000 reconstructions using the NB model including a power-law bias (upper right panel without and lower right panel with thresholding bias) and find that the relevant structures are in general terms very well recovered. The filaments in the low density regions are less accurately reconstructed (see upper and lower right panel). This is expected from the low signal-to-noise ratio in those regions. Moreover, the models we are using for the low density regime are not optimal. We lack a good description of the dark matter field in the low density regime due to the lognormal approximation. We also find that the high density peaks are less pronounced than in the true dark matter field. This is caused by the smoothing introduced from averaging the reconstructed samples, i.e., the mean estimator yields, as expected, a conservative reconstruction.
3.1Two-point statistics: power spectra
Let us further investigate the two-point statistics of the reconstructed fields in Fourier space, i.e., the power spectrum. This is shown in Fig. ? and Fig. ?. In these plots we can clearly see the scale-dependent bias of the halo samples (solid green line) with respect to the dark matter field (solid black line). The power spectra of the halo fields have been corrected for the mass assignment kernel and shot-noise following [?]. Therefore, the halo power spectra can be trusted up to approximatly , which is about 50% of the Nyquist frequency. In Fig. ? the reconstruction using the Poisson likelihood with unity bias (dashed magenta line) is very close to the halo power spectrum, esentially confirming that the shot-noise correction follows a Poisson model, as the deterministic bias is neglected in this run. We find a considerable improvement by adjusting the power-law bias (dotted red line). Nevertheless, the power spectrum lackes power towards high s, since the dispersion has been underestimated. We register 10% deviation at Mpc. Only when modeling also over-dispersion with the NB likelihood we find an excellent agreement with the dark matter power spectrum up to high s of nearly 1Mpc. This is in agreement with the findings of [?], however, by performing an inference analysis. We also find that the power spectra show a larger scatter for the NB than for the Poisson case (compare cyan with yellow shaded regions, respectively). This is expected from the larger dispersion of the NB model.
3.2Cell-to-cell cross correlation
To further assess the accuracy of our reconstructions, we compare them to the true dark matter field within a cell-to-cell correlation (see Fig. ?). To have a fair comparison we smooth each catalogue with a Gaussian kernel with smoothing length of Mpc(see left panels in Fig. ?). This should compensate for the different number density of haloes and dark matter particles. For the argo reconstructions the average over 10000 samples is shown. We find that the haloes and the Poisson unity bias case show very similar cell-to-cell correlations, which are strongly biased towards high densities. This could have interesting applications to reconstruct the expected halo density field (see Appendix E, Fig. ?) In the low density regime we can see that the reconstruction is more biased than the halo sample. This must be caused by the lognormal assumption, as the bias was set to one in this run. The negative binomial reconstruction shows unbiased cell-to-cell correlations towards high densities, and the same bias at low densities. We can see that the scatter is larger for the NB than for the Poisson unity bias case or the halo distribution. This is expected from the stochasticity included in the model. Non-local contributions, which may be deterministic, are absorbed in our stochastic bias model, thereby, possibly enhancing the scatter. Interestingly, we see in our run (iv) including the thresholding bias, that the biased low halo density (see lower left panel in Fig. ?) is corrected in the low density region (see lower right panel in Fig. ?), achieving a similar accuracy to the run (iii) where haloes are included in the low density regions (see upper right panel in Fig. ?). We also find in run (iv) using subsample (moderate massive objects of minimum mass of ), that the deviation from Poissonity is negligible. We expect an under-dispersed distribution for more massive haloes, clusters or quasars. The method presented in this work is therefore optimal for over-dispersed tracers like emission line galaxies.
4Conclusions
We have presented a Bayesian reconstruction algorithm, which is able to produce unbiased samples of the underlying dark matter field from non-linear stochastic biased tracers.
We find in this study that an accurate reconstruction of the dark matter field, requires both modeling of the stochastic and the deterministic bias. Our results using a power-law bias model and the negative binomial distribution function are very encouraging, as they produce unbiased statistics over a wide range of scales and density regimes.
We have focused on the negative binomial distribution function and discussed for which parameter dependencies it can be equivalent to the gravitational thermodynamics PDF. Furthermore, we model the deterministic bias relating the expected galaxy number density to the dark matter density through a non-linear scale-dependent expression.
We have presented the Hamiltonian equations of motions for our model and implemented them in the argo code. We have shown that this permits us to efficiently sample the posterior distribution function of density fields given a set of biased tracers. We have also introduced a Gibbs-sampling scheme to deal with strongly biased objects tracing the high density peaks.
In particular, we have tested our algorithm with the Bolshoi -body simulation, inferring the underlying dark matter density field from a subsample of the corresponding halo catalogue. We found that a Poisson likelihood yields reconstructions with power spectra deviating more than 10% at Mpc. Our method shows that we can draw closely unbiased samples (compatible within 1 ) from the posterior distribution up to scales of about 1 Mpc in terms of power-spectra and cell-to-cell correlations with the NB PDF.
We have furthermore analytically shown that our method can deal with incomplete data and perform a multi-tracer analysis. Further investigation need to be done here to demonstrate the level of accuracy of such approaches.
We will demonstrate in a forthcoming publication that we can also correct for redshift space distortions in an iterative way. Our work represents the first attempt to deal with nonlinear stochastic bias in a reconstruction algorithm and will contribute towards an optimal analysis of galaxy surveys.
Acknowledgments
MA thanks the Friedrich-Ebert-Foundation. The authors thank Benjamin Granett for encouraging discussions on the multi-tracer analysis. The MultiDark Database used in this paper and the web application providing online access to it were constructed as part of the activities of the German Astrophysical Virtual Observatory as result of a collaboration between the Leibniz-Institute for Astrophysics Potsdam (AIP) and the Spanish MultiDark Consolider Project CSD2009-00064.
AMulti-tracer analysis
Let us assume that we have a set of halo or galaxy samples NN. We can write the joint problem of inferring the dark matter field conditioned on the different halo/galaxy samples by the following posterior PDF
with NN being the joint bias.
If the samples have distinct biasing parameters, we can assume that each of the samples is conditioned on the underlying dark matter field only
Hence, we can write the posterior PDF as
For the Hamiltonian sampler (see § Section 2.6) we need to compute the potential energy, which is defined as
This expression permits us to incorporate any additional galaxy sample and combine different galaxy catalogues with the method presented in this work. The above calculations demonstrate that the dark matter field serves as a common denominator for different halo/galaxy samples and allows one to perform a multi-tracer analysis.
BGradient calculations
Here we calculate the derivatives of the prior and the likelihood (see Section 2.5) separately with signal vector .
b.1Prior
The derivative for the Gaussian prior for the linearized Gaussian field s writes as
b.2Likelihood
We calculate the gradient of the likelihood
as follows
We can now just calculate the derivative of the likelihood w.r.t. and get the final result as
Taking this into account, the derivative of the Likelihood writes as
We note that for , which is identical to the solution of the likelihood.
b.3Gibbs sampling of the thresholding
The density cut-off bias component introduces a numerical instability, since the additional gradient terms diverge around the density threshold. Therefore we follow a Gibbs-sampling strategy by considering the step-function to be constant with respect to the signal . This permits us to neglect all the terms where gradients and logarithms of the step-function appear (see previous section). The scheme can be described as follows:
We consider a vanishing uncertainty for the PDF of the step function given and , which is equivalent to a Dirac delta density function. We have tested this scheme and found it very stable. Nevertheless, we consider investigating the introduction of some uncertainty in the PDF of the step function in future work. This may yield to an even better agreement in the power spectra.
CConvergence of the Hamiltonian sampler with nonlinear non-Poisson likelihoods
(-210,175) (-210,133) (-210,89) (-210,45.5) (-210,4) (-220,90) (-158,-3) (-113,-3) (-64,-3) (-19,-3) (-110,-15)Cell |
There is no unique way to estimate the convergence of a Markov Chain nor a stopping criterion in literature. In principle, multiple chains are supposed to converge to the same stationary distribution and thus all samples should be consistent. Also within one running chain the drawn samples should be consistent after the burn-in phase. So comparing the means and variances within one converged chain to the samples of independently run chains will probe the convergence of the chains. This has been introduced by [?]. The test can be applied to any parameter without loss of generality.
We assume chains of length . The output of the chain is denoted as , with and . In our case would the multidimensional ensemble of the overdensity of cell in our reconstructed volume. For simplicity we show the calculations for a one dimensional observable . Starting from an identical proposal distribution we calculate as follows
Calculate each chain’s mean value
Calculate each chain’s variance
Calculate all chains’ mean
Calculate the weighted mean of each chain’s variance
Calculate the average variance within one chain
The potential scale reduction factor (PSRF) then is defined as
If all chains converge to the same target distribution, we expect the variance within one chain to be close to the variance between the chains, so that the PSRF to be close to one.
DMass ranges of the Bolshoi simulation
In Fig. ? we show the mass function of the Bolshoi simulation in a cumulative histogram. Our cut of creates a subsample with 13000 tracers.
(-210,129) (-210,86) (-210,43) (-210,-1) (-220,90) (-202,-6) (-140,-6) (-73,-6) (-8,-6) (-110,-17) |
EReconstruction of the halo density field
We also run argo with a Poisson likelihood and unity bias. We can see in Fig. ? that this model creates smoothed reconstructions of the biased halo field (compare Fig. ?). The averaged cell-to-cell correlation is in excellent agreement with the outcome of the halo field.
(-75,20) (-95,-17) N (-160,70) (-117,-7)2 (-88,-7)4 (-60,-7)6 (-31,-7)8 (-6,-7)10 (0,109) (0,73) (0,35)10 (0,-1)1 |