# Bayesian physical reconstruction of initial conditions from large scale structure surveys

## Abstract

We present a fully probabilistic, physical model of the non-linearly evolved density field, as probed by realistic galaxy surveys. Our model is valid in the linear and mildly non-linear regimes and uses second order Lagrangian perturbation theory to connect the initial conditions with the final density field. Our parameter space consists of the 3D initial density field and our method allows a fully Bayesian exploration of the sets of initial conditions that are consistent with the galaxy distribution sampling the final density field. A natural byproduct of this technique is an optimal non-linear reconstruction of the present density and velocity fields, including a full propagation of the observational uncertainties. A test of these methods on simulated data mimicking the survey mask, selection function and galaxy number of the SDSS DR7 main sample shows that this physical model gives accurate reconstructions of the underlying present-day density and velocity fields on scales larger than Mpc/h. Our method naturally and accurately reconstructs non-linear features corresponding to three-point and higher order correlation functions such as walls and filaments. Simple tests of the reconstructed initial conditions show statistical consistency with the Gaussian simulation inputs. Our test demonstrates that statistical approaches based on physical models of the large scale structure distribution are now becoming feasible for realistic current and future surveys.

## 1Introduction and Motivation

Ongoing and planned Large Scale Structure (LSS) surveys will measure the distribution of galaxies at an unprecendented level of accuracy in the coming decade. These surveys are expected to vastly enhance our constraints on the physics of cosmogenesis, neutrino physics, and dark energy phenomenology.

How do we compare cosmological models to these surveys? We have an observationally well-supported physical model of the initial conditions. According to this model, a homogeneous and isotropic density field with small, very nearly Gaussian, and nearly scale-invariant correlated density perturbations arose from quantum perturbations in the very early Universe. Gravitational evolution in an expanding background processed these initial conditions into an evolved density field, at first through linear transfer and then through non-linear structure formation. LSS surveys catalogue the positions of observed tracers of this evolved density field in redshift space.

It is now standard to model the initial Gaussian density perturbations statistically in terms of the early universe processes that created them, such as the physics of inflation , the change from matter to radiation dominated universe, neutrino free-streaming, and the acoustic oscillations of photon-baryon plasma. Within the standard cosmology, the evolution and growth of the initial perturbations in an expanding Universe is well-understood in principle, and directly linked to its dominant constituents such as dark matter and dark energy. It therefore seems natural to analyse LSS surveys directly in terms of the simultaneous constraints they place on the initial density field and the physical evolution that links the initial density field to the observed tracers of the evolved density field.

For a variety of good reasons the current state of the art of statistical analyses of LSS surveys is far removed from this ideal. There are some areas where significant progress seems very difficult. In particular, a detailed physical model of the way galaxies arise in response to the spatial fluctuations in the dark matter distribution is not computationally tractable (the “bias” problem). Even for the dark matter alone, reversing the non-linear evolution that link the initial and evolved density field is a fundamentally ill-posed problem [60].

As a consequence, the state of the art in the analysis of galaxy surveys addresses these problems in isolation. In the standard approach, the link between theory and observation is made through the power spectrum. This requires solving two separate problems: the data analysis problem of inferring the power spectrum from an observed sample of tracers given a survey mask and selection function [20]; and the much more difficult theoretical problem of modeling the power spectrum *and the form of its likelihood* for the non-linearly evolved and biased galaxy density field [2].

Three-dimensional inference of the matter distribution from observations requires modeling the statistical behavior of the mildly non-linear and non-linear regime of the matter distribution. The exact statistical behavior of the matter distribution in terms of a probability distribution for the fully evolved density field is not known. Previous approaches therefore relied on phenomenological approximations such as multivariate Gaussian or log-normal distributions incorporating a cosmological power-spectrum to accurately account for the correct two-point statistics of the density fields. Both of these distributions can be considered as maximum entropy prior on a linear and logarithmic scale, respectively, and are therefore well-justified for Bayesian analysis. However, these priors only parametrize the two-point statistics of the matter distribution. Since large scale structure formation through gravitational clustering is essentially a deterministic process described by Einstein’s equations and since the only stochasticity in the problem enters in the generation of initial conditions, it seems reasonable to account for the increasing statistical complexity of the evolving matter distribution by a dynamical model.

In this paper we describe progress towards such an approach that uses data to constrain a set of *a priori* possible dynamical, three-dimensional histories. We use second order Lagrangian perturbation theory (2LPT) as a physical model of the gravitational dynamics that link the initial three-dimensional Gaussian density field to the observed, non-Gaussian density field. In Bayesian parlance our prior for the evolved density is the initial Gaussian density field evolved by a 2LPT model. Using the powerful sampling techniques recently developed by [30] we can use this model as prior information and explore the range of initial Gaussian density fields that are statistically consistent with the data, modeled as a Poisson sample from evolved density fields.

Our method will also automatically generate reconstructions of the large scale velocity field since our model incorporates dynamics. Since the approach is implemented in a fully Bayesian framework we do not produce unique reconstructions, but a set of *samples* which can be interpreted as a probabilistic representation of the information the observations contain about the underlying density (initial and evolved) and the velocity field. In particular, the variations between samples represent the uncertainties that remain in the reconstruction owing to the modeled statistical and systematic errors in the data.

### 1.1Comparison to prior work

In the recent past several papers have pointed out the promise of the lognormal model in fitting to observations of the non-linear density field [41]. While the lognormal approach provides a good model of the 1-point and 2-point functions of the field we will show that Gaussian statistics evolved by 2LPT reproduces those successes but, in addition, reproduces features naturally that are associated with the higher order n-point functions such as filaments and walls. This is not surprising since it is well known that 2LPT reproduces the exact one- and three-point statistics of fully non-linear density fields at large scales, and also approximates higher order statistics very well [57].

The field of velocity field reconstructions has a long history [4]. The contribution of our approach is the imbedding of a non-Gaussian model in a probabilistic framework. Zel’dovich and MAK are, respectively, perturbative and non-perturbative attempts to reconstruct the displacement field linking the initial conditions from tracers of large scale structure and as such also generate estimates of the velocity field. Our approach goes beyond these works in several ways: we combine the inference with a detailed non-Gaussian model of realistic survey features (mask, selection function and shot noise); we implement explicitly a Gaussian prior for the initial density field; and the Bayesian exploration gives a quantitative characterization of the uncertainties in our inferences.

Significant effort has also been invested in establishing accurate representations of the observed Universe in numerical simulations, by constraining simulations by observations [44]. Many of these approaches rely on integrating the observed present day density field backwards in time to the initial state. Such an approach is generally hindered due to incomplete observations of the final state and by spurious erronous enhancement of decaying mode power in the initial conditions during backward integration [60]. The fully probabalistic approach, proposed in this work, naturally accounts for uncertainties of only partially observed final states, by exploring physical reasonable solutions, filtered by the 2LPT model, for the initial conditions which can all lead to the same or similar final observations. Furthermore, our method solely depends on forward evaluations of the model, which therefore accurately handels the issue of decaying mode power. Also note that unique recovery of initial conditions is generally not possible on all scales due to the chaotic nature of the dynamical large scale structure formation process on small scales [60]. These uncertainties will also be accurately accounted for by our method while exploiting information on the initial conditions on all scales accessible to the 2LPT model.

The paper is structured as follows. In Section 2 we discuss the design of posterior distributions for large scale structure inference and show that the complex problem of modeling accurate prior distributions for the evolved non-Gaussian matter distribution can be recast as an initial conditions problem once a physical model for large scale structure formation is specified. Furthermore, we will present the resultant 2LPT-Poissonian posterior distribution for the inference of the three dimensional matter distribution from galaxy surveys. Section 3 provides a brief overview over the Hamiltonian sampling approach employed in the inference framework described in this work, and in Section 4 we present the relevant derivations of the Hamiltonian forces required for an efficient numerical implementation of the Hybrid Monte Carlo sampler. In the following Section 6, we describe the generation of an artificial galaxy survey, inspired by the Sloan Digital Sky Survey data release 7 main sample [1]. In Section 7 we describe the application of our method to this simulated data in order to provide a proof of concept and to estimate the behavior of the algorithm in a realistic setting. We will conclude the paper with a summary and a discussion of the results in Section 8.

## 2The 2LPT-Poissonian posterior

#### The non-Gaussian density prior

As already pointed out in the introduction, inferring the three dimensional large scale structure from observations requires the design of suitable prior distributions for the fully gravitationally evolved density field. Standard approaches such as Wiener filtering employ Gaussian priors, which are assumed to be suitable for the inference of the largest scales[45]. For the inference of the density field in the non-linear regime log-normal priors have been proposed and successfully applied to large scale structure inference problems [41]. More recently, [37] proposed to use Edgeworth expansions to construct prior distributions incorporating also third order moments of the distribution. All of these approaches are based on heuristic approximations to the full problem. Currently, a closed form description of the present day density field in terms of a multivariate probability distribution does not exist.

While there exist considerable difficulties in constructing a suitable probability distribution for the present day density field, the initial seed fluctuations at redshifts obey Gaussian statistics to great accuracy, in agreement with theories of inflation and observations [50]. Therefore, the complicated nature of the present matter distribution solely originates from deterministic physical processes during structure formation. Generally, gravitational interactions introduce mode coupling and phase correlations, such that the statistical behavior of the present day density strongly deviate from a Gaussian distribution [61].

Since initial and final conditions are linked via deterministic structure formation processes, it seems reasonable to formulate the inference problem in terms of simpler statistics at the initial conditions, rather than approximating the complex statistical behavior of the non-linear matter distribution. More specifically, given a suitable model of large scale structure formation we can obtain a prior distribution for the final density contrast for a given cosmic scale factor by marginalizing over the initial conditions:

where, for a deterministic structure formation model, the conditional probability is given by Dirac delta distributions:

Given a model for structure formation, a prior distribution for the present day density field can be obtained by a two step sampling process, by first generating an initial conditions realization from the prior distribution and then propagating the initial state forward in time with a suitable model of large scale structure formation. This process amounts to generating samples from the joint prior distribution of the final and initial conditions:

By discarding the initial density realization, one obtains a realization from the prior distribution for the non-linear final state. Assuming, a multivariate Gaussian process with zero mean and covariance matrix to generate the initial density field the joint prior distribution is given as:

In this work, we will employ a second order Lagrangian perturbation theory (2LPT) model to approximately describe gravitational large scale structure formation (also see Appendix B for an overview over the 2LPT model). 2LPT is able to recover the exact one-, two- and three-point statistics at large scales, and also approximates higher order statistics very well [57]. The 2LPT model therefore naturally reproduces physically reasonable higher order statistics in the matter inference problem without requiring the introduction of additional parameters for the description of higher order statistics. Our approach therefore provides a physically meaningful way of matching the higher order statistics of the evolved density field while requiring no further knowledge other than the initial two-point statistics.

#### The large scale structure likelihood

Above we demonstrated that the task of modeling accurate prior distributions for the statistical behavior of the present day matter distribution can be recast into an initial conditions inference problem once a model for large scale structure formation is specified.

The methods described in this work are general and can be adapted for the inference from any particular probe of the three dimensional large scale structure. We will illustrate our method for the case of optical galaxy redshift surveys.

Galaxies tend to follow the gravitational potential of the cosmic matter distribution and thus can be considered as matter tracers. The statistical uncertainty due to the discrete nature of the galaxy distribution can be modeled as an inhomogeneous Poisson processe [48]. Also note that Poissonian likelihoods have already been successfully employed for non-linear density field inference [41]. Following this approach, the corresponding Poissonian likelihood distribution can be expressed as:

where is the observed galaxy number at position in the sky and is the expected number of galaxies at this position. The mean galaxy number is related to the final density field via:

where is a linear response operator, incorporating survey geometries and selection effects, is the mean number of galaxies in the volume and is a nonlinear, non local, bias operator at position [30].

The joint posterior distribution for the initial conditions and the final density field given the galaxy observations is then obtained by the multiplying equation (Equation 1) and (Equation 2):

We see that given a model of structure formation , the final density field is a free byproduct of the inference process. Consequently, marginalizing equation (Equation 4) over yields the desired target posterior distribution for large scale structure inference:

This result requires several remarks. First, A nearly trivial, but nevertheless important, conclusion to draw from (Equation 5) is that large scale structure inference depends solely on the initial conditions . Consequently, the complex task of designing suitable prior distributions for the inference of the evolved density field can be recast into an initial value problem by modeling a suitable physical model to account for the complexity of the final state.

Second, note that inferring the initial density field does not involve backward in time integration of the physical model. The inference process exclusively requires model evaluations in the forward time direction, counter to the widely held notion that inference of initial conditions requires backward integration of the equations of motion. Nevertheless, traditional approaches of initial conditions inference, also known as ’time machines’, rely on the inversion of the flow of time in the model equations [60]. As pointed out by [60], the disadvantage of backward integration is that it may lead to artificial increase of decaying modes amplitudes introducing erroneous artificial density and velocity fluctuations in the initial conditions. Also note that large scale structure surveys only provide limited information on the full final state, due to survey geometries and statistical uncertainties. These problems generally hinder a unique backward integration of the partially observed final state to the initial conditions.

To alleviate this problem, and to ensure physical meaningful backward integration of non-linear models, one has to augment unobserved regions in the data with statistically meaningful information mimicking the unobserved part of the evolved density field. This in turn requires accurate knowledge on the multivariate probability distribution for the evolved final state , which is not known at present.

Such problems are naturally prevented by casting the reconstruction of initial conditions as the statistical inference problem expressed in equation (Equation 5). Since the proposed method solely depends on forward model evaluations, unobserved regions and statistical uncertainties can be easily dealt with in the initial conditions, which amounts to modeling simple, uncorrelated Gaussian processes. Further, the posterior distribution proposed in equation (Equation 5) accounts for systematics, such as survey geometry, selection effects and biases but also for statistical uncertainties such as the noise of the galaxy distribution and cosmic variance.

We therefore see that statistical uncertainties do not allow a unique inference of the initial conditions from partially observed final states. Consequently, our approach aims at exploring the highly non-Gaussian and non-linear posterior distribution of the initial density field conditional on galaxy observations in order to quantify the uncertainty and significance of the inferred initial conditions. The overall inference process is numerically non-trivial. It is made possible by sophisticated non-linear analysis methods, which will be described in the following.

## 3Hamiltonian sampling

As described in the previous section, exploration of the initial conditions posterior distribution requires numerically efficient Markov Chain Monte Carlo methods to accurately account for all non-linearities and non-Gaussianities involved in the inference process. Unfortunately, unlike as in the Gibbs sampling approach for large scale structure proposed in [33], direct sampling from this posterior is not possible. One therefore has to rely on a sampling procedures with an accept-reject step for the exploration of the high dimensional parameter space encountered in this problem. A major draw back of traditional algorithms of this type, e.g. Metropolis-Hastings, is their dominant random walk behavior and a possible high rejection rate if no suitable proposal distribution can be designed. In this work, we usually deal with about , or more, free parameters which correspond to the initial density contrast amplitudes at the volume elements of the analyzed volume. Due to this high dimensionality of the problem, a low acceptance rate of the Metropolis-Hastings algorithm would result in a prohibitive computational cost for the method. Given this situation, we propose to use a Hybrid Monte Carlo (HMC) method, which in the absence of numerical errors, would yield an acceptance rate of unity. The HMC method exploits techniques developed to follow classical dynamical particle motion in potentials [14]. In this fashion the Markov sampler follows a persistent motion through the parameter space, suppressing the random walk behavior. This enables us to sample with reasonable efficiency in high dimensional spaces [26]. Furthermore, the HMC has already been successfully applied to non-linear large scale structure inference problems [30].

In the following, we will just briefly outline the basic idea of the hybrid Hamiltonian sampling algorithm. More detailed description of the algorithm and its application in large scale structure inference and in general can be found in [14].

### 3.1The HMC

Suppose, we wish to generate samples from a probability distribution , where is a set consisting of the elements . If we interpret the negative logarithm of this posterior distribution as a potential:

and by introducing a ’momentum’ variable and a ’mass matrix’ , as nuisance parameters, we can formulate a Hamiltonian describing the dynamics in the multi dimensional phase space. Such a Hamiltonian is then given as:

As can be seen in equation (Equation 7), the form of the Hamiltonian is such that the joint distribution is separable into a Gaussian distribution in the momenta and the target distribution as:

For this reason, marginalization over all momenta will again yield the original target distribution .

In order to generate a random variate from this joint distribution, being proportional to , one first draws a set of momenta from the distribution defined by the kinetic energy term that is an dimensional Gaussian with a covariance matrix . Then one follows the deterministic dynamical evolution of the system, given a starting point in phase space for some fixed pseudo time according to Hamilton’s equations:

The integration of this equations of motion yields the new position in phase space. This new point is accepted according to the usual acceptance rule:

Since the equations of motion provide a solution to a Hamiltonian system, energy or the Hamiltonian given in equation (Equation 7) is conserved, and therefore the solution to this system provides an acceptance rate of unity. In practice, numerical errors can lead to a somewhat lower acceptance rate. Samples of the desired target distribution are then obtained by simply discarding the auxiliary momenta , which amounts to marginalization over these nuisance parameters. The particular implementation of the hybrid Hamiltonian Monte Carlo sampler for the problem described in this work will be discussed below.

## 4Equations of motion for LSS inference

As described above, the HMC approach permits to explore the non-linear large scale structure posterior by following Hamiltonian dynamics in the high dimensional parameter space. The corresponding forces, required to evaluate these Hamiltonian trajectories can be derived from the large scale structure posterior given in equation (Equation 5). Consequently, the Hamiltonian potential can be written as:

with the potential is given as:

and is given as:

Given the above definition of the Hamiltonian potential one can obtain the required Hamiltonian forces by differentiating with respect to :

Here the prior term is given by:

In contrast the likelihood term cannot be derived as trivially. A detailed derivation for the likelihood term can be found in Appendix D. The likelihood term can be expressed as:

where the vector and the tensor are defined in Appendix D.

Finally, the equations of motion for the Hamiltonian system given in equations (Equation 9) and (Equation 10) can be written as:

and

New samples from the large scale structure posterior can then be obtained by following the dynamical evolution of the Hamiltonian system in phase space.

## 5Numerical Implementation

We named our numerical implementation of the initial conditions sampler BORG (Bayesian Origin Reconstruction from Galaxies). It utilizes the FFTW3 library for Fast Fourier Transforms and the GNU scientific library (gsl) for random number generation [21]. In particular, we use the Mersenne Twister MT19937, with 32-bit word length, as provided by the gsl_rng_mt19937 routine, which was particularly designed for Markov Chain Monte Carlo simulations [54].

### 5.1The leapfrog scheme

The numerical implementation of the HMC sampler employed in this work largely follows the implementation described in [30]. Generally the numerical integration scheme is required to meet some conditions in order to achieve optimal efficiency of the sampler. High acceptance rates require the numerical integration scheme to be highly accurate in order to conserve the total Hamiltonian. Low accuracy in the integration scheme will generally lower the acceptance rate. Additionally, the integrator must be symplectic, meaning exactly reversible, in order to ensure the Markov Chain satisfies detailed balance [14]. For this reason, we implemented the leapfrog scheme to integrate the Hamiltonian system. It is also numerically robust, and allows for simple propagation of errors. In particular, we implement the Kick-Drift-Kick scheme. The equations of motions are integrated by making steps with a finite step size , such that :

We iterate these equations until . Also note that it is important to vary the pseudo time interval , to avoid resonant trajectories. We do so by drawing and randomly from a uniform distribution.

### 5.2Hamiltonian mass

The HMC possesses a large number of tunable parameters contained in the ’mass’ matrix which have an important effect on the performance of the sampler. The Hamiltonian mass defines the inertia of individual parameters when moving through the parameter space. Consequently, too large masses will result in slow exploration efficiency, while too light masses will result in large numerical errors of the integration scheme leading to high rejection rates.

Generally, if the individual would be Gaussian distributed, a good choice for HMC masses is to set them inversely proportional to the variance of that specific [73]. For non-Gaussian distributions it is reasonable to use some measure of the width of the distribution [73]. For example, [59] proposes to use the curvature at the peak. A suitable approach to define Hamiltonian masses is to perform an approximate stability analysis of the numerical leapfrog scheme [73]. Following this approach, we will expand the Hamiltonian forces, given in equation (Equation 13), around a mean signal , which is assumed to be the mean initial density contrast once the sampler moved beyond the burn-in phase. As described in Appendix F approximating the Hamiltonian forces to linear order amounts to approximating the target distribution by a Gaussian distribution. According to the discussion in Appendix F, the Hamiltonian masses should be set as:

where is defined in Appendix D. Calculation of the leapfrog scheme requires inversions of . Considering the high dimensionality of the problem, inverting and storing is computationally impractical. For this reason, we construct a diagonal ’mass matrix’ from equation (Equation 21). We found that choosing the diagonal of , as given in equation (Equation 21), in its Fourier basis yields faster convergence for the sampler than a real space representation since it accounts for the correlation structure of the underlying density field.

## 6Generating Mock observations

In the previous sections we presented the derivation and the implementation of our method. Here we will describe the generation of mock data sets that will be used to test our method. Following closely the description in [30], we will first generate a realization for the density contrast from a normal distribution with zero mean and a covariance matrix corresponding to a cosmological power-spectrum in a a three dimensional Cartesian box with , corresponding to volume elements, and a co-moving box length of . The density field will then be scaled back to an initial time corresponding to a cosmological scale factor by multiplication with a cosmological growth factor , described in Appendix A. In particular, we use a cosmological power-spectrum for the underlying matter distribution, with baryonic wiggles, calculated according to the prescription described in [15] and [16]. We assume a standard CDM cosmology with a set of cosmological parameters (, , , , , ). The Gaussian initial conditions are then propagated forward in time using second order Lagrangian perturbation theory as described in Appendix B. From the resultant particle distribution the final density contrast field is constructed via the cloud in cell (CIC) method [29].

An artificial galaxy catalog is then generated by simulating the inhomogeneous Poisson process given by equation (Equation 2) on top of the final density field . In order to set up a realistic testing environment, we seek to emulate the main features of the Sloan Digital Sky survey as closely as possible. We employ the survey geometry of the Sloan Digital Sky Survey data release 7 depicted in the right panel of figure ?. The mask has been calculated using the MANGLE code provided by [71] and has been stored on a HEALPIX map with [24]. Further, we assume that the radial selection function follows from a standard Schechter luminosity function with standard r-band parameters ( , ), and we only include galaxies within an apparent Petrosian r-band magnitude range and within the absolute magnitude ranges to . The corresponding radial selection function is then obtained by integrating the Schechter luminosity function over the range in absolute magnitude:

where is given in Appendix C. The resulting selection function for the simulated galaxy sample is depicted in the left panel of figure ?. The survey response operator , required to simulate the Poisson process, can be calculated as the product of the survey geometry and the selection function at each point in the three dimensional volume:

Finally, we choose , and perform the Poisson sampling on the grid resulting in a total number of galaxies .

## 7Testing

In this section, we describe the application of our method to the artificial data set described in Section 6. The primary intention of the following tests is to evaluate the performance of our method in realistic situations, taking into account observational systematics and uncertainties.

### 7.1Testing convergence and correlations

The Metropolis Hastings Sampler in general and the HMC in particular are designed to have the target distribution, in our case the large scale structure posterior distribution, as its stationary distribution [55]. For this reason, the sampling process will provide us with samples from the specified large scale structure posterior distribution after an initial burn-in phase. The length of this initial burn-in phase has to be assessed using numerical experiments.

Generally, burn-in manifests itself as a systematic drift of the sampled parameters towards the true parameters from which the artificial data set was generated. This behavior can be monitored by following the evolution of parameters in subsequent samples [19]. To test this initial burn-in behavior we will arbitrarily reduce the power of the random initial density field by a factor of , and observe the drift towards the true underlying values by following successive power-spectra measured from the samples. In figure ? successive power-spectra of the first 800 samples are presented. The plot nicely demonstrates the systematic drift towards the true underlying solution by successive restoration of the true power in the initial density field.

More specifically, we can quantify the initial burn-in behavior by comparing the th power-spectrum measurement in the chain to its true underlying value via:

The results for this exercise are presented in the lower panel of figure ?. It can be nicely seen that the algorithm restores the correct power an all scales and drifts towards preferred regions in parameter space to commence exploration of the large scale structure posterior. It is also important to remark that in this test, we do not observe any particular hysteresis for the poorly constrained large scale modes, meaning they do not remain at their initially set values but efficiently explore the parameter space. This already indicates the ability of our algorithm to account for artificial mode coupling as introduced by the survey geometry.

Burn-in also manifests itself in the initial acceptance rate as shown in the left panel of figure ?. The dip in the initial acceptance rate function corresponds to the point when the sampling algorithm restored the full power-of the initial density field. This has a simple explanation. Initially, since the power was reduced by a factor of ten, the system behaved more or less linear since the displacement of 2LPT particles is small. Once the correct power is restored the displacement of particles increases, leading to a higher non-locality of the system. When the dip in the acceptance rate occurs, the sampler starts exploring physically more reasonable states in the initial conditions which can explain the observations. This process is accompanied by an initially lower acceptance rate. Once the sampler moves to a reasonable region in parameter space the acceptance rate approaches asymptotically a rate of about 84 percent. This high acceptance rate, combined with the fast de-correlation properties, we will demonstrated next, shows that our sampler makes sampling from this multi-million dimensional, non-linear posterior numerically feasible.

In particular, these tests show that the algorithm requires an initial burn-in phase of on the order of samples before providing samples from the target distribution. Also note that the initial burn-in period can be shortened by initializing the algorithm with an initial density field which is closer to the true solution.

Another important issue to study when dealing with Markov Chain Monte Carlo methods is the mixing efficiency of the algorithm. Generally, successive samples in the chain will not be independent but correlated with previous samples. Consequently, the correlation between successive samples determines the amount of independent samples which can be drawn from the chain. We study this effect by following a similar approach as described in [19] or [33].

Assuming all parameters in the the Markov chain to be independent of one another one can estimate the correlation between subsequent density samples by calculating the autocorrelation function:

where is the distance in the chain measured in iterations [19]. The results for this analysis are presented in figure ?, where we plot the correlation coefficients for individual density amplitudes selected by their signal to noise ratio. It can be generally seen that the mixing efficiency is a little lower in regions with low signal-to-noise but the mixing efficiency of the algorithm is very good overall.

We can further define a correlation length of the Markov sampler as the distance in the chain beyond which the correlation coefficient has dropped below a threshold of . As can be seen in figure ? the correlation length is about 200 samples, demonstrating the high mixing efficiency of the algorithm. Despite the high dimensionality of the problem considered here, these tests demonstrate that exploring large scale structure posterior for the initial conditions from observations exhibiting systematic uncertainties and uncertainties is numerically feasible with our method.

### 7.2Large Scale Structure inference

In this section we will discuss the results obtained from the application of the algorithm to the artificial data set, as described in Section 6. The initial and final density fields have been inferred on a cubic Cartesian box with side length of , resulting in a grid resolution of about . While sampling the algorithm will provide matter field realizations for the initial and final 2LPT density fields, generated conditionally on the observed data.

In figure ? we show slices from three different sides through samples of the initial and corresponding final densities as well as through the data. It is immediately visible that the statistics of the initial and final matter fields differ greatly. While the initial density field appears to be very Gaussian, the final density field is clearly non-Gaussian. This demonstrates how the physical 2LPT model for structure formation is able to account for the growing statistical complexity of the density distribution during the evolution from the initial to the final state. Furthermore, comparison of the final density field (middle panels of figure ?) to the data (lower panels of figure ?) demonstrates the recovered structures from the data. One can nicely see that the algorithm tries to extrapolate unobserved filaments between clusters based on the physically reasonable assumptions provided by the 2LPT model. In general, the algorithm nicely recovers the filamentary structure of the matter distribution.

Figure ? illustrates that the algorithm accurately accounts for survey geometry and selection effects by augmenting unobserved or poorly observed regions with statistically correct information. The inferred initial and final density fields possess equal power throughout their entire domains and are not affected by selection or mask artifacts. Individual density samples can be understood as physical, three-dimensional matter field realizations, at least to the degree permitted by the 2LPT model. It is particularly interesting that unobserved and observed regions in the inferred final density fields do not appear visually distinct, a consequence of the excellent approximation of the 2LPT not just to the first but also higher order moments [57]. This is a great advantage over previous methods based on Gaussian or log-normal models where the assumption of a cosmological power-spectrum only specifies the two-point statistics correctly. In particular, the reader may want to compare with figure 2 in [32], where unobserved regions are augmented with a log-normal model unable to represent filamentary structures.

In figure ? we compare the one-point distribution of the inferred initial and final density field measured from the corresponding samples. It can be seen that while the initial density contrast follows Gaussian statistics, the final distribution is highly skewed and represents the expected log-normal like behavior. These results therefore supports our initial claim that the complex problem of modeling a prior distribution for the present fully non-linear density field can be exchanged for an initial conditions inference problem when assuming a physical model which accounts for the increasing statistical complexity of the matter distribution during structure formation.

Further, we estimate the accuracy of the recovered density field by estimating the correlation coefficient between density samples and the true underlying solution as a function of some parameter . The correlation coefficient is given as:

where we will choose to be the signal to noise ratio for the final density field and a specific smoothing scale for the initial density field. The results for these tests are demonstrated in figure ?. The left panel of figure ? depicts the correlation between the true underlying final density field and the final density samples as a function of the signal to noise ratio. It can be seen that the correlation with the truth is generally higher for higher signal to noise ratios. Even in zones that contain just a single galaxy we still get a correlation of about 55 percent. It is also remarkable that the algorithm still provides a 10 percent correlation with the true underlying density field in regions which have not been sampled by galaxies such as centers of voids or masked regions. The right panel of figure ? demonstrates the cross correlation between the true underlying initial conditions and the inferred ensemble mean initial field as a function of filter scale when smoothed with a spherical top hat filter in Fourier space. These results clearly demonstrate that the large scales of the initial conditions can be much easier recovered than the small scale features. This is in agreement with the expectation, since the largest scales behave more linearly than the smaller scales and hence are easier to recover. Particular the shot noise contribution at the smallest scales in the final galaxy observation will smear out features in the initial conditions, since the 2LPT displacement vector for the particles will fluctuate on these scales.

#### Dynamics

Importantly, the algorithm provides dynamical information on the large scale structure given the 2LPT model. In figure ?, we show the comparison between the true underlying velocity field, and the velocity field inferred by an randomly selected sample. It can be seen that the algorithm is able to recover the true underlying velocity field in detail. Our inference is clearly able to identify the true underlying velocity field in noisy or even completely masked regions. This clearly demonstrates the strength of this approach in extrapolating physically reasonable states of the matter distribution even into poorly observed regions.

## 8Discussion and Conclusion

We describe a new method to perform dynamical large scale structure inference from galaxy redshift surveys employing a second order Lagrangian perturbation model. In Section 2 we demonstrated that the problem of constructing suitable prior distributions for the non-linear density field is directly linked to the problem of inferring initial conditions, once a dynamical model for large scale structure formation is given. In this approach the evolved non-linear density field acts as a mere nuisance parameter in the inference process, which shifts the problem of designing prior distributions to physical modeling of the matter evolution dynamics.

Since the method we propose provides an approximation to the non-linear dynamics the algorithm automatically provides information on the dynamical evolution of the large scale matter distribution. By exploring the space of dynamical *histories* compatible with both data and model our approach therefore marks the passage from Bayesian three-dimensional density inference to full four-dimensional state inference.

Particularly, in this work we have employed a 2LPT model as an approximate dynamical description of the large scale structure evolution on the large scales. As described in the literature, the 2LPT model describes the one, two and three-point statistics correctly and represents higher order statistics very well [57]. Hence, the algorithm proposed in this work can exploit higher order statistics, modeled through the 2LPT model, to provide physically reasonable matter field realizations conditional on the observed galaxy distribution.

It is also important to remark that the inference process described in Section 2 requires at no point the inversion of the flow of time in the dynamical model. The inference process therefore solely depends on forward propagation of the model, which consequently alleviates many of the problems encountered in previous approaches to the reconstruction of initial conditions, such as spurious decaying mode amplification. Rather than inferring the initial conditions by backward integration in time our approach builds a non-linear filter using the dynamical forward model as a prior. This prior singles out physically reasonable large scale structure states from the space of all possible solutions.

The resultant inference procedure is numerically highly non-trivial, since the large scale structure posterior distribution has to be evaluated in very high dimensional space. Typically we are dealing with to parameters, corresponding to the voxels used to discretize the domain. In Section 3, we described an efficient Hybrid Monte Carlo implementation for the large scale structure inference problem when employing a dynamical model for large scale structure formation. Further, we discussed some details of the numerical implementation in Section 5.

To provide a proof of concept we test the algorithm in an artificial scenario, based on the characteristics of the Sloan Digital Sky Survey Data Release 7. In particular, as described in Section 6, we use the SDSS DR7 completeness map and realistic selection functions based on the Schechter luminosity function to generate a realistic testing environment essentially emulating the SDSS DR7 main sample.

The major aim of testing the algorithm, described in Section 7, was to estimate the method’s performance in a realistic scenario. An important issue to test when dealing with Markov Chain Monte Carlo methods is the question of how many independent samples can be drawn from the chain. The high efficiency of our Hybrid Monte Carlo scheme permits to explore the posterior distribution with a typical acceptance rate of about per cent while maintaining the correlation length of the chain at or below steps. We estimate the length of the burn-in phase to be about 600 steps. In summary, our tests reveal that the proposed analysis approach is not only within numerical reach but is efficient enough to work well with present day computational resources.

The properties of the inferred large scale structure fields were studied in Section 7.2. It is clear upon visual inspection that our approach returns far more physical reconstructions than previous methods based solely on two-point information [45]. This is particularly obvious for unobserved regions which are augmented with statistically correct information, in order to account for survey geometry and cosmic variance. In the present approach augmented regions are visually indistinguishable from regions containing data. Therefore, the individual matter field samples can be regarded as full physical matter field realizations conditional on the observations, at least to the degree represented by the 2LPT model.

By studying the one-point distributions of the inferred initial and final density fields we demonstrated that the algorithm correctly recovers the Gaussian initial conditions from a galaxy observation which does not exhibit Gaussian but highly skewed log normal like statistics. This demonstrates that the algorithm correctly, accounts for the mode coupling and phase correlations originally introduced to the matter distribution by gravitational structure formation. In addition, it supports our initial claim that the approach of searching for phenomenological approximations to the full probability distribution for the non-linear matter field can be efficiently reformulated as an initial condition problem once a physical model for large scale structure is employed.

To estimate the accuracy of recovered density fields, we studied the correlation between the true underlying and samples of the final density field as a function of the signal to noise ratios. As expected, the correlation can reach up to 90 per cent in the high signal to noise regime, where . In addition, the algorithm still provides a correlation of about 50 per cent between the true underlying density field and the samples in regions where only a single galaxy has been observed. Also note that in regions where the signal to noise is zero, which are either centers of voids or unobserved regions, the algorithm still provides a 10 per cent correlation. This is a clear manifestation of improved inference due to the incorporation of a physical model of large scale structure formation, which exploits additionally three point and higher moment statistics of the density distribution. These tests further demonstrate that the algorithm correctly accounts for systematics such as the survey geometry and selection effects.

Along with the inferred density fields the algorithm also provides dynamical information on the large scale flows. By comparing the true underlying velocity field to the inferred velocity field of an arbitrary sample we demonstrated that the algorithm accurately recovers large scale flows, even in noisy or even unobserved regions. This clearly demonstrates the strength of the method in extrapolating physically reasonable states into poorly observed regions.

The method we describe forms the basis for a sophisticated and extensible dynamical large scale structure inference framework. In future work we will demonstrate the application of the algorithm to a real galaxy survey accounting for additional systematics such as luminosity or color dependent bias. Note that the algorithm as described in this work can be easily extended to account for any kind of non-linear and non-local bias. In particular, the 2LPT model, as employed in this work, can already be interpreted as a non-local, non-linear bias model between the initial conditions and the galaxy observations. It would also be possible to incorporate a halo model based galaxy bias model in the fashion as described by [69]. The combination of the algorithm described in this work and the photometric redshift sampling method proposed in [34], will lead to immediate improvements for the inferred photometric redshifts, since the combination of both algorithms will exploit higher order statistics, whereas the algorithm described in [34] is solely based on two-point statistics. In a similar fashion, dynamical velocity information provided by the 2LPT model can be used to correct for redshift uncertainties in spectroscopic surveys.

Since the algorithm is fully Bayesian, it provides inferred initial and final density fields and also the means of estimating their significance and uncertainties by a sampled representation of the initial conditions posterior distribution. The algorithm will therefore provide accurate information on the initial conditions from which the observed large scale structure originates. These initial density fields may be useful for a variety of scientific projects such as constrained simulations [44]. Since the 2LPT model reconstructs the initial tidal field it may also open up a new way to study the angular momentum build-up of galaxies through tidal torque theory [?].

In conclusion, we presented a new Bayesian dynamical large scale structure inference algorithm which will provide the community with accurate measurements of the three dimensional initial density field as well as estimates of the dynamical behavior of the large scale structure.

## Acknowledgments

JJ is partially supported by a Feodor Lynen Fellowship by the Alexander von Humboldt foundation. BDW acknowledges support from NSF grants AST 07-08849 and AST 09-08693 ARRA, and a Chaire d’Excellence from the Agence Nationale de Recherche and computational resources provided through XSEDE grant AST100029.

## ALinear structure formation

In the linear regime structure formation is governed by a homogeneous growth function acting on the density contrast . For a general cosmology the growth factor can be obtained by numerical solution of the linear growth equation [75]:

## BLagrangian Perturbation theory

In the following we will give a brief summary of second order Lagrangian perturbation theory to the degree required for the present work. More detailed discussion of Lagrangian perturbation theory in general and its application can be found in the literature [57]. Also see [3] for a general overview of Eulerian and Lagrangian cosmological perturbation theory.

In an expanding Robertson Friedman space time the equations of motion for particles solely interacting through gravity are given as [3]:

where is the gravitational potential and is the gradient with respect to the Eulerian coordinates , and the conformal time defined by . In order to solve this set of equations, Lagrangian perturbation theory introduces the following Ansatz for a solution:

where defines the mapping from the particles initial position into its final Eulerian position [3]. Equation (Equation 26) together with equation (Equation 25) yields a non-linear equation for the displacement field which can be solved perturbatively by expanding around its linear solution [3]. To linear order, this perturbative approach yields the famous Zel’dovich approximation given as [?]:

Adding second order terms to the perturbative expansion remarkably improves the results of the first order Zel’dovich approximation. In particular, second order terms account for the fact that gravitational instability is non-local by introducing corrections due to gravitational tidal effects [3]. The second order displacement field is then defined by [3]:

with and is the second order growth factor given as:

which holds for a flat model with non-zero cosmological constant and for to better than per cent accuracy [5].

As has been previously shown, second order Lagrangian perturbation theory recovers correctly the two- and three-point statistics at large scales and further approximates higher-order statistics very well [57]. Also note, that second order corrections to the Zel’dovich approximation are essential to accurately describe the departure of the large scale density field from Gaussian initial conditions [69].

Lagrangian solutions up to second order are curl free, as they follow potential flows [8]. Therefore, it is convenient to introduce the Lagrangian potentials and , such that the approximate solution to equation (Equation 25) can be expressed as [8]:

where the time-independent potentials and are solutions to the following Poisson equations [8]:

and

For an excellent guide to the numerical implementation of the 2LPT model the reader is referred to appendix D of [?].

## CThe Schechter Luminosity function

The Schechter luminosity function is given as [?]:

Note that for the purpose of calculating selection functions the normalization is not required.

## DHamiltonian Forces for the likelihood term

In this section we will discuss the derivation of the Hamiltonian forces for the 2lpt-Poissonian process. To prevent confusion between the variables describing the physical 2LPT model and the variables describing the Hamiltonian inference framework we will re express the 2LPT model in the following form for the purpose of the derivations in this section:

where and are the first and second order displacements fields, respectively.

As described in Section 4 the likelihood term of the Hamiltonian potential is given as:

with given via the kernel estimate as:

and is described by Equation 33 and is a CIC kernel [29]. Furthermore, the Lagrangian displacement vectors are given as:

where is the gradient of the kernel . With these definitions we can write the Hamiltonian forces corresponding to the likelihood term as:

The notation can be simplified by introduce the quantity as:

We can then write:

where we made use of equations (Equation 33) and (Equation 35). It can be seen that the Hamiltonian force is the sum of two vectors. In the following we will therefore discuss each term independently. The first term is exactly the Hamiltonian force expected from a pure Zeldovich approximation without higher order correction terms. We will start by evaluating the Hamiltonian force for the Zeldovich approximation.

The notation can be further simplified by introducing:

We can then write

The Zeldovich Approximation potential was calculated using the Fast Fourier Transform approach, which can be written as:

Using this expression in equation (Equation 39) we yield:

This result looks remarkably similar to equation (Equation 49) and at first sight one might be inclined to straightforwardly solve this equation with FFT techniques. However, it is important to note that the signs have changed in the exponents, and hence equation (Equation 39) can not directly be solved with FFTs. In Appendix E, we show what procedures must be followed in order to apply FFTs to this problem. To further simplify the notation in the following steps we will introduce the quantity , defined as:

With this definition the ZA term of the Hamiltonian force can be written as:

Next, we will discuss the second order Lagrangian term in equation (Equation 38):

The second order Lagrangian potential can be calculated as

with given as:

where the individual potentials are related to the signal via

With these definitions, we can write