Cosmic Inference: Constraining Parameters With Observations and Highly Limited Number of Simulations
Abstract
Cosmological probes pose an inverse problem where the measurement result is obtained through observations, and the objective is to infer values of model parameters which characterize the underlying physical system – our Universe. Modern cosmological probes increasingly rely on measurements of the smallscale structure, and the only way to accurately model physical behavior on those scales, Mpc, is via expensive numerical simulations. In this paper, we provide a detailed description of a novel statistical framework for obtaining accurate parameter constraints by combining observations with a very limited number of cosmological simulations. The proposed framework utilizes multioutput Gaussian process emulators that are adaptively constructed using Bayesian optimization methods. We compare several approaches for constructing multioutput emulators that enable us to take possible interoutput correlations into account while maintaining the efficiency needed for inference. Using Lyman forest flux power spectrum, we demonstrate that our adaptive approach requires considerably fewer — by a factor of a few in Lyman case considered here — simulations compared to the emulation based on Latin hypercube sampling, and that the method is more robust in reconstructing parameters and their Bayesian credible intervals.
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
Lawrence Berkeley National Laboratory
Berkeley, CA 94720, USA
1 Introduction
The field of cosmology has rapidly progressed in the last few decades, going from a largely qualitative picture of the “Hot BigBang” to the now welltested “Standard Model of Cosmology”. This relatively simple model describes current observations at a few percent level using only six parameters (Planck2018). While this has been a great success — driven by deluge of observations — outstanding questions still remain about the nature of dark matter and dark energy, primordial fluctuations relating to the inflation in the early universe, and the mass of neutrino particles. To make further progress towards answering these questions, new ground and spacebased observational missions will be carried out probing into highly nonlinear scales of cosmic structure. Incoming widefield sky surveys such as the Dark Energy Spectroscopic Instrument (DESI), the Large Synoptic Survey Telescope (LSST), the Wide Field Infrared Survey Telescope (WFIRST, WFIRST), and Euclid (Euclid) will provide precision measurements of cosmological statistics such as weak lensing shear correlations, cluster abundance, and the distribution of galaxies, quasars and Lyman absorption lines. Inferring values of the physical model’s parameters using observations of the mentioned sky surveys is a problem which belongs to the class of statistical inverse problems.
The application of Markov chain Monte Carlo (MCMC, Metropolis1953; Gelman2004) or similar Bayesian methods requires hundreds of thousands to even millions of forward model evaluations in order to determine the posterior probabilities of the considered parameters. When modeling the highly nonlinear regime of the structure formation in the universe, each such evaluation is a highperformance computing simulation costing more than CPU hours, as even the most elaborate perturbation theory methods break down around scales of (Carlson2009), and thus cannot be reliably used. Cosmological simulations which from first principles numerically evolve the density field will therefore be essential for the analysis and scientific inference of the future observational data sets.
While at first it may seem that this cost makes the inference computationally unfeasible, it is in fact possible to efficiently sample the parameter space with a dramatically reduced number of simulations, provided that certain smoothness conditions are satisfied. This is achieved through the development of cosmological emulators, that is, computationally cheap surrogate models of expensive cosmological simulations. The pioneering work on these techniques in cosmology was the “cosmic calibration” effort (Heitmann2006; SHabib_KHeitmann_DHigdon_CNakhleh_BWilliams_2007a), resulting in 1% accurate matter power spectrum emulator (Heitmann2010). This was followed by emulating weak lensing observables (Liu2015; Petri2015), galaxy halo occupation model (Kwan2015), halo mass function (McClintock2018), galaxy correlation function (Zhai2018), 1D Lyman flux power spectrum (Walther2018), and 21cm power spectrum (Jennings2019).
In this work we are not concerned with building an emulator for the cosmological simulation models that is accurate over the entire prior range of the input parameter values. Rather we focus on constructing an emulator in an adaptive fashion by preferentially selecting the inputs for the simulation that are more likely to result in the values of the output that are consistent with the observational data. By building up our emulator in this sequential way we strive to avoid performing unnecessary simulations that would be needed to have a globally accurate surrogate. Similar Bayesian optimization for the construction of an emulator of the 1D Lyman flux power spectrum has been recently considered in KKRogers_HBPeiris_APontzen_SBird_LVerde_AFontRibera_2019a. The authors use a different acquisition function than the one used in this work. Furthermore, they do not consider multioutput emulators as we do but build multiple singleoutput ones.
The execution of such iterative workflow can be efficiently executed on highperformance computing platforms using the system described in Lohrmann2017. Briefly, as the workflow requires exploration of the parameter space via simulation trials, each of such simulations becomes a job managed by a parallel scheduler. This approach relies on Henson (Morozov2016), a cooperative multitasking system for in situ execution of loosely coupled codes.
Our treatment of multioutput emulation is different from the previous approach of SHabib_KHeitmann_DHigdon_CNakhleh_BWilliams_2007a which relied on dimensionreducing techniques. Instead of approximating the power spectrum in the basis obtained from a principal component decomposition of the simulator’s covariance structure, we assume a simple separable form for the covariance of the power spectrum as a vectorvalued function (similarly to the approach of SConti_AOHagan_2010a). This allows us to start with a small number of training inputs for the initial emulator construction and iteratively refine the initial design. Additionally, the separable structure of the covariance function allows us to perform training and prediction with the emulator using Kronecker products of small matrices making it efficient.
In this work, we are using 1D Lyman (Ly) flux power spectrum as an output quantity of interest. Following reionization happening around redshift , the diffuse gas in the intergalatic medium (IGM) is predominantly photoionized, but the small residual fraction of the neutral hydrogen gives rise to Ly absorption observed in spectra of distant quasar sightlines (for a recent review, see McQuinn2016). This socalled Ly forest is the premier probe of the IGM and cosmic structure formation at redshifts . As Ly absorption at is sensitive to gas at around the cosmic mean, and at redshifts even to the underdense gas in void regions of the universe (Lukic2015), complex and poorly understood physical processes related to galaxy formation are expected to play only a minor role in determining its structure (Kollmeier2006; Desjacques2006). Forward modeling the structure of the IGM for a given cosmological and reionization scenario is thus a theoretically wellposed problem, albeit requiring expensive cosmological hydrodynamical simulations. The onedimensional power spectrum is a summary statistic of the Ly flux field which measures Fourierspace analogue of 2point correlations in flux absorption along lines of sight to quasars. This statistic can be sensibly used to measure cosmological parameters (Seljak2006), constrain neutrino sector (NPD2015; Rossi2015), to probe exotic dark matter models (Irsic2017), or to measure thermal properties of IGM (Walther2018). Here, we focus on parameters describing the thermal state of the IGM, similarly as in Walther2018. However, there is nothing specific to the Ly forest probe, particular data set or simulations in our inference formalism, thus the method we present here can straightforwardly be applied to other cosmological probes as well.
The outline of the paper is as follows. In Section 2, we describe the details of the forward model for the Lyman power spectrum. In addition to hydrodynamical simulations, we also use an approximate model for postprocessing the thermal state of the IGM which is described in Section 2.3. In Section 3, we provide a highlevel overview of our main approach to inferring cosmological parameters from measurement data. First, we state the general Bayesian inference problem, and, following IBilionis_NZabaras_2014a, show how it can be reformulated using a Gaussian process (GP) emulator as a Bayesian surrogate of the forward model. Next, we provide an outline of the adaptive algorithm developed in TTakhtaganov_JMueller_2018a that we use to construct a GP emulator iteratively. The details of the GP emulator construction and a comparison of the approaches to modeling interactions between emulator outputs are provided in Section 4, as well as in the Appendix A. Results of applying this method on MViel_et_al_2013a data and inferring thermal parameters of the IGM are given in Section 5. Finally, we present our conclusions in Section 6.
2 Forward Model
In this paper, we analyze different ways of inferring the model parameters using flux power spectrum observations. To this end, it is necessary to model the growth of cosmological structure and the thermal evolution of the IGM on scales far smaller (down to kpc)) than those described by the linear perturbation theory. Cosmological hydrodynamical simulations with atomic cooling and UV heating are the only method capable of modeling this process at the percent level accuracy (for approximate methods, see Sorini2016; Lochhaas2016, and references therein). Unfortunately, such simulations are computationally very expensive, CPU hours or more. It is therefore desirable to also have a “reduced” model, which we can evaluate for a large number of points in the chosen parameter space, even if not as accurate as the full simulation model. In the following we will first review our “direct” simulation model, and in Section 2.3 we will present the approximate model based on postprocessing the simulation’s instantaneous temperaturedensity relation and the mean flux.
2.1 Simulations
The hydrodynamical simulations we use in this paper are part of the THERMAL suite of Nyx simulations (Walther2018) consisting of 75 models, each in Mpc box with Eulerian cells and dark matter particles. The Nyx code (Almgren2013) follows the evolution of dark matter modeled as selfgravitating Lagrangian particles, and baryons modeled as an ideal gas on a set of rectangular Cartesian grids. The Eulerian gas dynamics equations are solved using a secondorder accurate piecewise parabolic method (PPM) to accurately capture shocks. Besides solving for gravity and the Euler equations, we also include the main physical processes relevant for the Ly forest. We consider the chemistry of the gas as having a primordial composition of hydrogen and helium, include inverse Compton cooling off the microwave background and keep track of the net loss of thermal energy resulting from atomic collisional processes (Lukic2015). All cells are assumed to be optically thin to ionizing radiation, and radiative feedback is accounted for via a spatially uniform, timevarying UV background radiation given to the code as a list of photoionization and photoheating rates (Haardt2012).
Thermal histories are generated in a similar way as in Becker2011 through rescaling the photoheating by a density dependent factor: . Here, is the baryon overdensity, are the heating rates tabulated in Haardt2012 while and are free parameters adjusted to obtain different thermal histories. Note that while this approach makes it straightforward to change instantaneous densitytemperature relation in the simulation, changing the pressure smoothing scale is more difficult as it represents an integral of (an unknown) function of temperature across cosmic time. We will return to this point later in Section 5.
We choose mock sightlines, or “skewers”, crossing the domain parallel to one of the axes of the simulation grid and piercing the cell centers. Computationally, this is the most efficient approach. This choice of rays avoids explicit raycasting and any interpolation of the cellcentered data, which introduce other numerical and periodicity issues. As a result, from an cell simulation, we obtain up to mock spectra, each spectrum having pixels. We calculate the optical depth, , by convolving neutral hydrogen in each pixel along the skewer with the profile for a resonant line scattering and assuming Doppler shift for velocity (for details, see Lukic2015). We compute this optical depth at a fixed redshift, meaning we do not account for the speed of light when we cast rays in the simulation; we use the gas thermodynamical state at a single cosmic time. The simulated skewers are therefore not meant to globally mock observed Ly forest spectra, but they do recover the statistics of the flux in a narrow redshift window, which is what we need for this work. We have neglected instrumental noise and metal contamination in simulated skewers, but this will not be relevant for the conclusions of this paper.
2.2 Model parameters
Lyman forest simulations include both cosmological parameters as well as astrophysical parameters needed to model the thermal state of the gas, which is significantly affected by hydrogen and helium reionizations. Our main goal is to test and improve the parameter sampling scheme and the emulation method used for constraining the parameters; in order to reduce the computational expense, in this work we will focus our attention on the set of “standard” parameters, , describing the thermal state of the IGM. We keep the cosmological parameters fixed and based on the Planck2014 flat CDM model with , , , , .
The values for thermal parameters and are obtained from the simulation by approximating the temperaturedensity relation as the power law:
(1) 
and finding the best fit using a linear least squares method in – (Lukic2015). Therefore, parametrizes the temperature at mean density in the universe, while is the slope of temperaturedensity relation, expected to asymptote long after reionization ends. To determine the pressure smoothing scale, , we fit the cutoff in the power spectrum of the realspace Ly flux, as described in Kulkarni2015. Realspace Ly flux is calculated using actual density and temperature at each cell in the simulation, but omitting all redshiftspace effects such as peculiar velocities and thermal broadening.
In addition to these three parameters describing the thermal state of the gas, we also need to parameterize the uncertainty in the observed mean flux. In photoionization equilibrium, the mean flux of the Ly forest is proportional to the fraction of neutral hydrogen, and is thus degenerate with the amplitude of the assumed UV background. Therefore, mean flux can be rescaled by finding the constant multiplier of the optical depth of all spectral pixels in the simulation so that the mean flux matches the desired value: . For accuracy considerations of rescaling the mean flux, we refer the reader to Lukic2015.
2.3 Postprocessing instantaneous temperature of the IGM
Modification of the mean flux in postprocessing is physically motivated procedure, as explained in the previous section; however, modifying any other thermal parameter commonly requires running a new simulation. While modifying (3D pressure smoothing) is inherently difficult due to its dependence on the whole thermal history, we can hope to be also able to modify the instantaneous temperature – the one that determines the recombination rate and the thermal broadening (1D smoothing). That way, we can generate different values of {, , }, without the need to rerun expensive simulations. To test this, we use three simulations which have the same cosmological and numerical parameters, and yield the same pressure smoothing parameter at redshift . Temperaturedensity diagrams for these simulations are shown in Figure 2.1. The fiducial simulation has K and ; the “low” simulation differs from fiducial only in that , while the “low” simulation has and all other parameters the same as the fiducial model.
In Figure 2.2 we show power spectra ratios of low and low simulations with respect to the fiducial model. Mean flux is matched in all cases shown. Solid lines (with circles) are ratios of unscaled simulations, and we can see they are significantly different over the range covered by data (). Two dashed lines with square and triangle points are models where temperaturedensity relation has been rescaled to the fiducial one without and with accounting for the scatter in the . We notice the significant improvement in power spectrum, and we can also see that scatter in , as expected from optically thin models, does not play a significant role, although it help in a case of radical change in parameter as seen in the right panel of Figure 2.2. Finally, dashed line represent the case where we both rescale temperaturedensity relation, and use line of sight velocity from the fiducial simulation. This is not a practical solution, as we wouldn’t know these velocities when rescaling a simulation to a target relation, but we want to show that differences in velocity account for most of the remaining error in this rescaling procedure.
While our approximate, postprocessing model does not recover power spectrum at a percent accurate level over the whole range of available data, it is sufficiently accurate for experiments conducted in Section 4. The essential requirements there are to know the “true” answer for a given model, and to be able to evaluate the model a large number of times. Note also that this rescaling procedure is loosing accuracy at high end which is important for thermal constraints and interpreting from high resolution spectra, but over the range relevant to BOSS/eBOSS/DESI observations (), the achieved accuracy is 1%, which should suffice for many studies.
3 Adaptive construction of Gaussian process emulators for Bayesian inference
In this section we outline our main approach to the adaptive construction of the GP emulators. Our approach is designed to solve the specific problem at hand, which is inferring the parameters of interest that serve as input into the forward model of the power spectrum from the observational data. The main ingredient of our approach is a socalled “acquisition” function that guides the selection of the training inputs for the emulator. This acquisition function arises from the form of the likelihood for the measurement data. Thus, before explaining the acquisition process we start by providing a general framework.
We denote the parameters of interest by . We denote the vector of observations by , where each represents a measured value of the Ly forest flux power spectrum at a certain value of the wavenumber . The outputs of the forward model of the power spectrum for a given will be thought of as a dimensional vector (more on this in Section 4). We will work under the assumption of the Gaussian measurement noise with zero mean and known covariance . With this specification of the measurement noise we formulate the likelihood function for the observational data which depends on the value of the forward model at :
Assuming the Bayesian framework, we model the prior information about the parameters as a known distribution . Given the prior and the observed measurements , the solution of the inverse problem is the posterior density obtained by applying Bayes’ rule:
This posterior density can, in principle, be explored with MCMC methods. However, since evaluating the likelihood function requires evaluating the forward model , the direct application of MCMC methods for the current application (or any expensivetoevaluate forward model) is infeasible. This difficulty can be circumvented by using a surrogate model, such as a Gaussian process emulator, in place of the forward model. The Bayesian nature of the Gaussian processes makes them particularly suitable for our framework. Next, we provide a formal review of GP emulators leaving the details out until Section 4 and outline our adaptive approach to sequentially adding training inputs for the emulator.
Suppose that we have collected a set of evaluations of the forward model at input points:
The information in can be used to obtain a surrogate model specified by a random variable with a predictive distribution conditioned on the input and data :
Here denotes the hyperparameters of the predictive model, is the predictive distribution of the assumed model given the hyperparameters, and is the posterior distribution of given data . For the details on obtaining , see Appendix A.
The solution of the inverse problem with a limited number of forward model evaluations can now be formulated using the likelihood of the observational data evaluated using the surrogate model. This restricted likelihood (similarly to IBilionis_NZabaras_2014a) is defined as follows:
where . Once we have an approximation of the distribution , we can integrate the product of the two Gaussians and obtain an approximate formula for the likelihood , see TTakhtaganov_JMueller_2018a and Appendix A for details.
Evaluations of the approximate likelihood involve computing the following misfit function between the observational data and the predictions of the GP emulator
(2) 
for the hyperparameter samples from distribution . In (2), we denote by the mean vector of the GP emulator evaluated at , and by its predictive covariance. This misfit function captures the discrepancy between the observed values of the power spectrum and the predicted values in the norm weighted by the measurement error and the uncertainty of the emulator. Note that for the inputs in the training set , the mean of the GP emulator coincides with the values of the power spectrum , and the covariance vanishes, and thus the exact value of the misfit (and hence the likelihood) is known.
We use the misfit function (2) to inform our choice of the candidate inputs to add to the dataset in order to improve the GP surrogate. The “improvement” we are looking for is to make the approximate likelihood more accurately resemble the “true” likelihood . The overall accuracy of the GP emulator over the support of the prior is unimportant.
Our adaptive approach to extending the training set is based on an acquisition function commonly used in Bayesian optimization—the socalled “expected improvement” (EI) criterion (DRJones_MSchonlau_WJWelch_1998a). In our version of the EI criterion, we look for an input that provides the largest expected improvement in fit with expectation taken with respect to the posterior distribution of the hyperparameters :
(3) 
Here denotes the smallest misfit to the measurement data for the points in the current training set , and takes the positive part of the difference: . This formulation allows us to balance the exploration and the exploitation of the GP emulator in an iterative search for a new training input to maximize the expected improvement in fit function , i.e., we search for the input that provides the largest improvement in fit to the measurement data under the current GP model, conditioned on the misfit being smaller than the current best value for the points in the training set. The outline of the algorithm is given in Algorithm 3. For more details see TTakhtaganov_JMueller_2018a.
For our numerical experiments, we take the search space to be the support of the prior , and we set the threshold value to be . We solve the auxiliary optimization problem in Step 5 by using multistart gradientbased optimization, see TTakhtaganov_JMueller_2018a for details. We set the allowed number of simulations to a large number so that the effective termination condition is the one on line 6. In practice, is dictated by the simulation budget. In the next section, we discuss the details of the construction of the GP emulators for modeling the Ly forest power spectrum.
4 Gaussian Process Emulators for the Ly flux power spectrum
We model the power spectrum as a multioutput Gaussian process with outputs corresponding to the fixed values of the wavenumber . Furthermore, we assume a separable structure of the kernel function, meaning that it can be formulated as a product of a kernel function for the input space alone, and a kernel function that encodes the interactions between the outputs (MAlvarez_LRosasco_NDLawrence_2012a, Section 4). In the following subsections we discuss the details of the construction of GP emulators and contrast our approach to those used in related works.
4.1 Gaussian process emulators
We will treat as a function from to , where , , and is the number of values of the wavenumber for which we have observations. For a given vector of input parameters , we treat the output of the simulation code as a vector of the values of the power spectrum at fixed values of .
Similarly to SConti_AOHagan_2010a, we model as a dimensional separable Gaussian process:
conditional on hyperparameters of the correlation function , and symmetric positivedefinite matrix . This means that for any two inputs and , we have , , and . As indicated by several studies, e.g., HChen_JLLoeppky_JSacks_WJWelch_2016a, the introduction of the regression term does not generally affect the performance of the predictive model, and, in some cases, might have an adverse effect. In our case, adequate results were obtained by simply setting . Furthermore, we set the covariance function to be squaredexponential with hyperparameters :
Note that the choice of the covariance function here is purely empirical and does not affect the forthcoming methodology.
Our treatment of the interoutput covariance matrix differs from that in SConti_AOHagan_2010a and IBilionis_NZabars_BAKonomi_GLin_2013a. There, the authors assume a weak noninformative prior on the matrix and integrate it out of the predictive posterior distribution. Instead, we study four different approaches for treating interactions between outputs, summarized in the following Section 4.2.
4.2 Approaches for dealing with multioutput models

First, we test a naive approach that emulates each output separately with a singleoutput GP. We refer to this approach as (MS) as it corresponds to the MS (many singleoutput) emulator in SConti_AOHagan_2010a. This approach has an increased computational cost (which could be alleviated by training in parallel) compared to training only one GP as the following two approaches do. Also, this approach ignores any dependencies between the outputs.

In our second approach (IND), we treat the outputs as independent given the hyperparameters of the covariance function . This approach has been considered in IBilionis_NZabaras_2014a and TTakhtaganov_JMueller_2018a. It leads to a simple and efficient implementation of a multioutput emulator with a diagonal , see Appendix A for details.

Our third approach (COR) assumes that correlations between different outputs are nonzero but are still independent of the parameter . In this case, we fix the correlation matrix a priori and use it to obtain by rescaling by the training variances. This approach requires specifying the interoutput correlation matrix. In terms of computational efficiency, this approach still allows us to use the Kronecker product structure of the training covariances (see Appendix A) and is as efficient as using the diagonal covariance in approach IND.

Our final approach (INP) is related to Approach COR, but it is computationally more demanding. Here, we treat as another input dimension into the GP model with associated covariance kernel being again a squaredexponential. The main computational cost is associated with inverting the training covariance matrices, which become times larger due to the addition of another input dimension. Conceptually, however, this approach is similar to Approach COR with having entries specified by the squaredexponential kernel for a fixed (learned) value of the lengthscale hyperparameter associated with the input dimension. The difference to Approach COR is that the interoutput correlation matrix now depends on the training data. As our experiments demonstrate, this additional flexibility provides no discernible advantage. This approach is referred to as TI (timeinput) emulator in SConti_AOHagan_2010a where an extra dimension is time rather than .
All of our approaches utilize matrixvalued kernels that fall into the category of separable kernels, see (MAlvarez_LRosasco_NDLawrence_2012a, Section 4) for an overview. Specifically, our Approaches IND, COR, and INP are examples of the socalled intrinsic coregionalization model or ICM (MAlvarez_LRosasco_NDLawrence_2012a, Section 4.2). The ICM approach allows for an efficient implementation of GPbased regression and inference that exploits the properties of the Kronecker product of the covariance matrix, see Appendix A for details. For the adaptive algorithm, efficiency is important for solving the auxiliary optimization problem for the expected improvement in fit function . Solving this optimization problem requires multiple restarts as the function is highly multimodal, leading to a large number of evaluations of the GP emulator predictions and their gradients. For the GPbased inference, having an efficient emulator allows us to carry out MCMC sampling of the posterior with minimum computational effort. As the following numerical study of the considered approaches suggests, in our application it is reasonable to expect that the outputs corresponding to different values of the wavenumber have similar properties with respect to the parameters , therefore, our choice of the separable form of the kernel is justified.
4.3 Numerical study of multioutput approaches
In this section we compare the predictive performance of the different multioutput Gaussian process emulators introduced in Section 4.2 using an approximate model of the power spectrum described in Section 2.3. To obtain a better picture of the dependence of the results on the choice of design inputs, we build emulators using to inputs arranged in a Latin Hypercube Design (LHD) where the minimum distance between the points has been maximized—the socalled maximin LHD (see, e.g., MEJohnson_MLeslie_DYlvisaker_1990a). For each design we perform multiple experiments. For each experiment we generate a large test set consisting of inputoutput pairs for computing various measures of predictive accuracy. In order to avoid an unnecessary cost associated even with the postprocessing procedure of Section 2.3, we precompute the power spectrum on a dense grid in using our automatized Henson system (Morozov2016; Lohrmann2017) and interpolate the outputs using trilinear interpolation to obtain the continuous approximation of the power spectrum in . We have confirmed that the error associated with this interpolation is negligible. Specifically, we take a grid of input parameters covering the box
We restrict our attention to the redshift of and outputs corresponding to the following values , which cover the range of values as the MViel_et_al_2013a measurements that we use later. In the following we simply number these outputs from 1 to 8.
We construct the four multioutput GP emulators (MS, IND, COR, and INP) using fixed LHDs with 10, 20, and 30 points in . In each case, the training output values are normalized (see Appendix A). We fit the hyperparameters of the covariance function using the maximum likelihood approach (MLE in Appendix A). Recall that the COR emulator requires a fixed output correlation matrix . We obtain an estimate of this matrix by first using the INP emulator built with LHD 20 design. Upon training of the INP emulator we obtain an estimate of the length scale for its (output) variable. By pluggingin this estimate into the onedimensional squaredexponential kernel and evaluating the kernel for the values of that we consider, we get a desired estimate of . This estimate remains fixed for all experiments with the COR emulator, see Figure 4.3.
To test the predictive performance of the emulators we use a test set consisting of points in a randomized LHD. We use the following measures of predictive accuracy.

Standardized mean squared error (SMSE): this is the meansquared prediction error scaled by the variance of the test data for each output :
where is the th component of the vector of predictive means of the GP model, and is the mean of the test values , .

Credible interval percentage (CIP), also known as coverage probability: the percentage of the credible intervals that contain the true test value. For an emulator that provides adequate estimates of the uncertainty about its predictions, this value should be close to . We plot the CIP against and look for deviations from the straight line. This statistic can only be plotted for each output separately.

Squared Mahalanobis distance (SMD) between the predicted and the test outputs at a test point :
where and are the predictive mean and the predictive covariance of the multioutput GP emulator, respectively. According to the multivariate normal theory, this distance should be distributed as for all test points. A discrepancy between the distribution of distances for the emulator and a reference distribution indicates a misspecification of the covariance structure between the outputs.
Figure 4.3 reports the SMSE estimates for the three LHDs. In each case we repeat the experiment times, meaning that we generate training and test sets for each of the three designs and use them to train and test each of the four emulators. We observe a noticeable spread of the estimates between the experiments with the same number of design points regardless of the chosen emulator or output index. Increasing the size of the training design generally improves accuracy but does not necessarily combat the variation in results for different experiments. All four emulators demonstrate similar outputmarginal accuracy, and the difference in errors for different outputs is low, with the exception of output 8, which has consistently higher relative errors for all four emulators. The COR and the INP emulators appear to have a similar spread of the error values indicating that fixing the output covariance a priori rather than allowing it to depend on training data has little effect on accuracy. The MS and the IND emulators achieve smaller errors, but also have relatively larger spreads between experiments, which suggests higher dependence on the design than in the cases of COR and INP emulators.
While there is little that separates different multioutput approaches in terms of peroutput SMSE, the results for CIP and SMD provide a more complete picture.
Figure 4.3 reports the CIP estimates for the three LHDs and for selected output indices (1, 4, and 7, in the top, middle, and bottom rows, respectively). Here we report the CIP values averaged over the experiments. In general, we observe that all four emulators underestimate the uncertainty in their predictions, i.e., they are overconfident in their predictions. This trend becomes more pronounced with growing training design size. We view this as an artifact of the MLE approach to GP training, see discussion in (TTakhtaganov_JMueller_2018a, Section 3). Among the emulators, the MS emulator exhibits the worst performance and is most overconfident in its predictions in all cases. The other emulators compensate for the shortcomings of the MLE approach by accounting for correlations between the outputs. The IND emulator does it in a least effective way, and hence performs worse than the COR and the INP emulators. The COR emulator on average comes closest to providing accurate uncertainty estimates.
Figure 4.3 shows the box plots of the distributions of the squared Mahalanobis distances for the test points for the three designs. We only show the results of a single experiment for each design. The reference distribution has mean of and variance of . We observe that although all of the emulators fail to correctly model the posterior covariance (a direct consequence of underestimating the predictive uncertainty), the IND, COR, and INP emulators all do better than MS with the last two coming closest to the reference. If we look at the means of the distributions, then for LHD 10, the average values over the 20 experiments are 54, 38, 27, and 29 for the MS, IND, COR, and INP emulators, respectively. The results shown represent the general trend that, on average, COR and INP emulators tend to represent the predictive covariance better.
In terms of the wallclock time required for training and evaluating the four emulators, the most timeconsuming emulator to train is INP for which the required training time grows exponentially with the size of the training set. The next most timeconsuming emulator to train is MS, however, it can be easily trained and evaluated in parallel since there is no overlap between the outputs. The time for training IND and COR emulators is approximately the same and is considerably less than for the other two since only a single GP needs to be trained.
Based on the performed experiments, in the following we choose to work with IND and COR emulators due to their good performance on the three test statistics as well as good computational efficiency. The question of how to obtain the correlation matrix for the COR emulator still remains open. In our numerical experiments we did not observe significant differences in the emulator performance with the small changes to the correlation matrix . In particular, the estimates of obtained with the INP emulator were similar for all experiments. It appears that the variability in the results due to the training designs outweighs the variability from using approximate correlations.
As mentioned previously, an alternative way of treating the separable form of the covariance is to assume a “noninformative” prior on as in SConti_AOHagan_2010a which allows analytical integration of this matrix out of the predictive distribution. In addition, if the mean is taken to be a generalized linear model with a flat prior, it can also be treated analytically. We deviated from the approach of SConti_AOHagan_2010a for the sake of a simpler and more interpretable model.
5 Numerical study of inference with adaptive GP emulators
In this section we evaluate the performance of the adaptive construction of the GP emulators introduced in Section 3. First, we solve a synthetic inference problem for the same set of three parameters . For this task, we generate synthetic measurement data with the postprocessing model of Section 2.3 and corrupt it with noise. Similarly to Section 4.3, we use a trilinear interpolation of the outputs of the postprocessing model as the forward model for the GP construction and inference. We run Algorithm 3 and use the constructed GP emulator to obtain the posterior of the parameters given the measurements. We compare the results obtained with the adaptive approach to those obtained using the GP emulators built using fixed design (see Section 5.1). Having a relatively inexpensive forward model, we can obtain a reference posterior using the “true” likelihood , which allows us to obtain quantitative measures of the quality of the GPbased posteriors.
Following this detailed study, we use the adaptive algorithm to obtain parameter posteriors using observational data from MViel_et_al_2013a and using the postprocessing model 2.3 as the forward model. These results are reported in Section 5.3.
Finally, we use a simplistic version of our adaptive approach to construct posteriors for an extended set of parameters, namely , using the same Viel data and the THERMAL suite of Nyx simulations^{3}^{3}3http://thermal.joseonorbe.com/. Here, we restrict the search space to the 75 points for the parameters in this dataset and values of giving us a total of possible values of . We use our adaptive algorithm to select a small subset of the points for constructing a GP emulator. We compare the results obtained with this restricted version of our algorithm to those in Walther2018 where all 75 points for and several values of were used for the GP construction. Note that we do not expect to obtain identical results, as, besides differences in implementation (Walther2018 build a GP emulator using the PCAbased approach of SHabib_KHeitmann_DHigdon_CNakhleh_BWilliams_2007a, see below), we also use only MViel_et_al_2013a subset of measurement data for the given redshift. The reason for this approach is our focus on the inference method. However, even using a “restricted” version of our adaptive algorithm, we are able to effectively constrain the parameters using only a fraction of the available inputs and simulation results.
5.1 Stateoftheart dataagnostic approach
Currently, spacefilling LHDs are widely used by the cosmological community for selecting the training points for the construction of emulators. Typically, the number of training points is selected a priori and remains fixed. For example, in the first such work in cosmology (SHabib_KHeitmann_DHigdon_CNakhleh_BWilliams_2007a), the spacefilling LHD with points is selected for the problem of inferring cosmological parameters from matter power spectrum measurements. The authors employ a PCAbased approach to constructing multioutput emulators. Specifically, they use a singular value decomposition (SVD) of the simulations at the training points specified by the LHD. The weights of the SVD are then modeled as independent GP emulators.
A similar approach is taken in Walther2018 for recovering thermal parameters of the IGM using the Ly flux power spectrum. However, the training points do not exactly form an LHD, as one of the parameters, , is difficult to independently vary in a way that is not correlated with the other thermal state parameters and . This is because probes the integrated thermal history which is smooth for each individual physical model of heating and cooling of the IGM during and after reionization process. Of course, in principle one could generate models with abruptly changing instantaneous temperature such that the pressure smoothing does not have enough time to adjust, but we lack physical motivation for such models. In addition, one of parameters—the mean flux of the Ly forest—is not part of the LHD as it can be easily rescaled in the postprocessing, thus its sampling does not require running additional expensive simulations.
5.2 Comparison of the adaptive method and the dataagnostic approach
We continue to use the postprocessing model described in Section 2.3 as the true “forward model” (with the same outputs). We first generate the mock measurement by evaluating the model at a fixed , and we corrupt the resulting measurement vector with noise from a multivariate Gaussian distribution with . This level of measurement noise is consistent with the observational data (MViel_et_al_2013a) that we use later.
In order to analyze how the size of the initial design for the adaptive algorithm influences the obtained solution, we perform experiments with 4, 6, or 8 design points in the threedimensional parameter space ). In each iteration of Algorithm 3, we construct a GP surrogate using the current selection of design points, we solve the auxiliary optimization problem to maximize the expected improvement in fit function, and augment the design set with a single point that provides the largest predicted improvement. We iterate until the relative expected improvement drops below a threshold.
For each selection of the size of the initial training design, we run our algorithm 10 times, each time selecting new initial design points using a maximin LHD. For each of the initial designs, we run the algorithm with two emulator choices: IND and COR (see Section 4.2). For the COR emulator the interoutput correlation matrix is taken to be the same as the one used in Section 4.3 (see Figure 4.3). On average the final designs contain – inputs regardless of the number of points in the initial design or the emulator choice.
We perform MCMC sampling of the posterior using the integrated likelihood based on the constructed GP surrogates (see Appendix A for details). To obtain a quantitative measure of the quality of the obtained posteriors, we compute the Hellinger distance between the GPbased posteriors and the reference posterior obtained by a direct MCMC sampling with the “true” likelihood function, i.e., the likelihood of the measurement data that uses evaluations of our (postprocessing) forward model. The Hellinger distance is a metric for evaluating differences between two probability distributions. It can be related to other commonly used distance measures, such as total variation distance and KullbackLeibler divergence, see, e.g., MDashti_AMStuart_2016a. It has also been recently studied in the context of posteriors obtained with Gaussian process emulators (AMStuart_ALTeckentrup_2018a). The Hellinger distance between and is defined as follows:
(4) 
To compute we approximate the densities and by fitting kerneldensity estimates (KDEs) with Gaussian kernels to the generated samples from the respective posteriors and discretize the integral in equation (4) using 3dimensional Sobol’ sequence with points. The results for the posteriors obtained with the adaptive GPs (10 runs for each initial design) are presented in Figure 5 on the left.
We compare the posteriors obtained with the adaptive approach to the posteriors obtained by training GP models wtih fixed maximin LHDs. Here, we fix the design sizes to be , , and . As in the adaptive case, we train the GP emulators using both IND and COR approaches, and we repeat each experiment times. The results for the posteriors obtained with the fixed designs are shown in Figure 5 on the right.
The comparison of the results in Figure 5 demonstrates the superiority of the adaptive approach. The adaptive approach is able to achieve results that are closer to the reference posterior in the Hellinger distance, and often with fewer design points. Furthermore, the results for the adaptive approach are less spread out, and thus making it more robust and consistent.
In Figure 5, we replot the data from Figure 5 for the adaptive cases, and we show the final design sizes on the axis. As we can see, in most cases, the final design size is either 10, 11, or 12. There is no significant difference in the results for different initial design sizes. However, there appears to be more variability in the final design sizes when the initial design contains only 4 points (LHD 4)—the final designs have between 7 to 13 points. We also observe a small trend of decreasing distance values as the final design size increases from 7 to 10. However, beyond 10 there is no significant difference in the results. The COR emulator appears less likely to terminate with a design consisting of less than 10 points, but in terms of the distance values, COR does not outperform the IND approach.
Results for the Hellinger distances confirm that there is little difference between the IND and the COR approaches for our current application. As far as the choice of the initial design size, starting with smaller designs (4 to 6 points for the current threedimensional problem) leads to fewer forward model evaluations without compromising the quality of the result.
Additionally, we compare the highest posterior density (HPD) intervals of the posteriors obtained with the adaptive algorithm and with fixed LHDs. We estimate the HPD intervals from the posterior samples. The twodimensional projections of the HPD intervals for all parameter pairs are shown in Figures 5.2 (adaptive cases) and 5.2 (fixed cases). We overlay the HPD intervals from all experiments (2 approaches 3 initial designs 10 repetitions) for each case. In red we show the HPD intervals of the reference posterior. We observe a good correspondence between the HPD intervals for the adaptive designs. For the fixed designs the posteriors are more diffused. This comparison reinforces the conclusion that the results obtained with fixed designs are inconsistent, and, therefore, less reliable.
5.3 Results for the adaptive GP with postprocessing model and Viel data
In this section we apply our adaptive GP approach to the problem of inferring the same three parameters using the postprocessing model of Section 2.3 as the forward model of the power spectrum, and data from MViel_et_al_2013a for the redshift of .
The measurement data consists of seven values of the power spectrum for with errorbars that we treat as . The measurement noise covariance, thus, has a diagonal form . We take a uniform (flat) prior on all three parameters defined over the same box as in Section 4.3:
We use the COR emulator with the same as in Section 4.3 and initialize Algorithm 3 with a maximin LHD with points. The stopping threshold for the algorithm is again set to .
Figure 5.3 shows the design points at different iterations of the adaptive algorithm. The first figure shows the initial 4 points arranged in maximin LHD, the figure in the middle has three additional points after three iterations of the adaptive algorithm, and the last figure shows the final design upon termination.
Figure 5.3 shows iteration history of the algorithm with iteration corresponding to the initial design with four points. The blue line in this figure shows the value of the best misfit for the points in the training set in each iteration , , scaled by the best misfit value obtained upon convergence, . A point with a better misfit value is not obtained in every iteration, e.g., in iterations 2 and 3 we have the same as initially. The points added to the design in these iterations serve the purpose of decreasing the overall uncertainty of the GP. The new point added to the training set after iteration provides a reduction in for the first time (see red dot in the middle figure in Figure 5.3). The red dotted line in Figure 5.3 shows one minus the relative expected improvement in each iteration, i.e., . As the algorithm progresses, we expect both lines to approach 1.
5.4 Results for the adaptive GP with Nyx simulations and Viel data
In this section we work with the THERMAL suite of Nyx simulations consisting of 75 models with given parameters (see Figure 5.4). Mean flux we treat in postprocessing as in virtually all Ly power spectrum inference works. To be specific, in this work we produce equidistant values for the parameter in the interval for each of the 75 thermal models, thereby sampling the 4dimensional parameter space .
We run a restricted version of Algorithm 3 with the possible selection of limited to the existing thermal models. We start by selecting the first six points randomly, build a GP emulator using the IND approach, and evaluate the expected improvement in fit function for the remaining points (using Viel data). We select the point that corresponds to the largest improvement in fit, and iterate until no further improvement can be made. In this regime we do not perform a direct optimization over the parameter space but select the inputs out of the available THERMAL data. Our restricted algorithm terminates after iteration using a total of design points. Figure 5.4 shows the plot demonstrating the fit of the corresponding to the best found misfit value.
The other important ingredient is the construction of the prior . We use a flat prior for , , , and in a box constrained by the smallest and the largest values for each parameter. We then truncate this prior to the convex hull of the THERMAL grid points, as is done in Walther2018. The resulting truncated prior is shown in Figure 5.4. This truncation is done to avoid GP extrapolation into a region of parameter space where this IGM model cannot produce an answer, for example, in case of very low but very high pressure smoothing scale (see also discussion in Section 5.1 about the parameter).
The posterior obtained with this prior using the likelihood based on the adaptively constructed GP is shown in Figure 5.4. We observe that the resulting posterior is considerably more constrained than the prior, although we want to draw the reader’s attention to the poor constraints on parameter , which plays very little role at high redshifts (at most!) when the density of Ly absorbing gas is close to the cosmic mean. Overall, the marginal ranges and central values for the parameters are in good agreement with the ones reported in Walther2018. Note, however, that we do not use BOSS (NPD2013) measurements here, but only MViel_et_al_2013a data, as we want to avoid modeling of correlated Silicon III absorption which is present in the BOSS dataset. We prefer maintaining the forward model as simple as possible as the goal of this work is testing and improving inference schemes for the Ly power spectrum. In any case, the fact that our results presented here are completely consistent with Walther2018 indicates that the BOSS dataset contributes negligible information about the thermal state of the IGM at high redshifts.
6 Conclusions
In this work we described the use of an adaptive design of GP emulators of the Lyman flux power spectrum for solving inference problems for the thermal parameters of the IGM. To the best of our knowledge, while GP emulators are extensively used in the cosmological community, the adaptive selection of training inputs has been considered only very recently in KKRogers_HBPeiris_APontzen_SBird_LVerde_AFontRibera_2019a and in our work. Our motivation for this work is primarily the reduction of the number of computationally intensive simulations required to build a GP emulator. By prioritizing the regions of the parameter space that are consistent with the measurement data under the predictive model of the emulator, we obtain the desired reduction without sacrificing the quality of the parameter posteriors. A numerical study that we performed on a problem with an approximate model of the Lyman forest power spectrum and with synthetic measurement data demonstrated that our adaptive approach obtains consistently good approximations of the parameter posterior and outperforms a similarsize fixed design approach based on maximin Latin hypercube designs.
We provided a complete framework for building multioutput GP emulators that predict the power spectrum at the preselected modes . Our numerical study demonstrates that the resulting multioutput emulators that either treat outputs as conditionally independent given the hyperparameters (IND) or explicitly model linear correlations between the outputs (COR) are effective and computationally efficient. Furthermore, our approaches allow us to train emulators using only highly limited number of training inputs, which in turn enables the adaptive selection of additional inputs.
The initial results obtained with our adaptive approach are encouraging. Specifically, for the problem of inferring three thermal parameters of the IGM and mean flux using measurements of the power spectrum at seven values of our approach (constrained to the 75 available Nyx THERMAL simulations) required simulation outputs for only input values to constrain the parameters to the same level of accuracy as in Walther2018 that used substantially larger number of simulations.
Finally, we want to emphasize that we do not consider the “classical” parameterization of to be the best for modeling the state of the IGM, but we nevertheless perform this type of analysis as it is straightforward to make comparisons of our results with previous works. While these parameters have intuitive physical meaning in describing the thermodynamical state of the IGM, there are several practical problems with them. First, they are output rather than input parameters which brings significant difficulties with implementations of sampling and iterative emulation procedure. Second, these 4 parameters are parameterizing each time snapshot instead of the physical model itself. For that reason, we consider models which parameterize the time and duration of the reionization as well as associated heat input (Onorbe2018) as better and we will be using those in future works.
Authors are grateful to Jose Oñorbe for making his Nyx simulations available to us, as well as for providing helpful comments and insights. We thank Joe Hennawi and all members of the Enigma group^{4}^{4}4http://enigma.physics.ucsb.edu/ at UC Santa Barbara for insightful suggestions and discussions. This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract no. DEAC0205CH11231. This work made extensive use of the NASA Astrophysics Data System and of the astroph preprint archive at arXiv.org. \onecolumngrid APPENDIX
A Training GP emulators
Let , , represent training inputs with being a vector of outputs corresponding to a particular input. Denote by the vector of the normalized values of the th output, , defined as follows
where
(A1) 
with
The normalized training outputs together form an output matrix . Finally, the vectorized form of is obtained by stacking the normalized training outputs into a vector .
The set of all training inputs we denote by , and the combined set of training inputs and outputs, or training data, by .
Training a GP emulator requires specifying the hyperparameters of its kernel. These are characterised by the posterior distribution (note that conditioning on is implicit but not necessary since it’s fixed)
(A2) 
where
is the likelihood of the training data under the matrixnormal distribution defining the Gaussian process (see Section 4) and is referred to as evidence. Note that is a normalized version of the mean function obtained by applying the linear transformation (A1), and is the interoutput correlation matrix. In order to somewhat simplify this notation let us denote the covariance matrix for the training inputs by . Also, recall that we take . Thus, we have
In a vectorized form we can express the likelihood above as a regular multivariate normal density
(A3) 
How do we obtain the hyperposterior (A2)? Since no analytical form for this posterior exist, we describe it via a particle approximation (IBilionis_NZabaras_2016a, Section 2.6). That is we approximate the hyperposterior with a weighted sum of Dirac delta functions centered at samples :
(A4) 
with weights and .
One way to obtain such a particle approximation is by maximizing the likelihood of the data given by (A3). This leads to a singleparticle approximation
where
is the maximum likelihood estimator (MLE) of the hyperparameter vector. In the case of a flat prior on the hyperparameters this estimator coincides with a maximum a posteriori (MAP) estimator. MLE approach is convenient to work with, since the covariance matrix needs to be only formed once, however, it might lead to somewhat overconfident estimates of predictive uncertainties of the GP emulator . In the case of a sharply peaked likelihood the MLE estimator can be sufficient. Another way of obtaining the particle approximation of the hyperposterior is by sampling it using MCMC techniques. This way provides a more complete picture of the hyperposterior, albeit at an additional computational cost.
Whether using MLE or MCMC approach to obtaining hyperposterior , we need to be able to evaluate the logarithm of the likelihood function (A3) (note that by applying logarithm we preserve the order relation and obtain a betterbehaved function). In the following we derive the expression for the loglikelihood of the data and explain how it can be efficiently computed.
Let and let denote the entries of , then
In our implementation we first compute
where is the Cholesky decomposition of the input covariance, then compute
and set
where is the Cholesky decomposition of the output correlation matrix. Then
The expression above can be further simplified in certain cases. For example, for the IND emulator that treats outputs as independent given the matrix, the output correlation matrix , and the computation of does not require crossterms for .
B Obtaining predictions
In order to obtain a prediction for an untried input , we apply the standard GP formulas obtained by conditioning on the data . Furthermore, by exploiting the Kronecker product structure of the covariance, we can apply standard GP formulas to each output separately. Indeed, as shown in EVBonilla_MKChai_CWilliams_2008a,
where superscript norm indicates that this is the predictive mean of the GP fitted to the normalized outputs, and . For the predictive covariance we get
Upon rescaling we obtain:
with
and