Gaussbock: Fast parallel-iterative cosmological parameter
estimation with Bayesian nonparametrics
We present and apply Gaussbock, a new embarrassingly parallel iterative algorithm for cosmological parameter estimation designed for an era of cheap parallel computing resources. Gaussbock uses Bayesian nonparametrics and truncated importance sampling to accurately draw samples from posterior distributions with an orders-of-magnitude speed-up in wall time over alternative methods. Contemporary problems in this area often suffer from both increased computational costs due to high-dimensional parameter spaces and consequent excessive time requirements, as well as the need for fine tuning of proposal distributions or sampling parameters. Gaussbock is designed specifically with these issues in mind. We explore and validate the performance and convergence of the algorithm on a fast approximation to the Dark Energy Survey Year 1 (DES Y1) posterior, finding reasonable scaling behavior with the number of parameters. We then test on the full DES Y1 posterior using large-scale supercomputing facilities, and recover reasonable agreement with previous chains, although the algorithm can underestimate the tails of poorly-constrained parameters. In addition, we provide the community with a user-friendly software tool for accelerated cosmological parameter estimation based on the methodology described in this paper.
0000-0003-0897-040X]Ben Moews \move@AU\move@AF\@affiliationInstitute for Astronomy, University of Edinburgh, Royal Observatory, Edinburgh EH9 3HJ, UK
0000-0001-9789-9646]Joe Zuntz \move@AU\move@AF\@affiliationInstitute for Astronomy, University of Edinburgh, Royal Observatory, Edinburgh EH9 3HJ, UK
Bayesian methods are now a standard approach to data analysis and inference in astrophysics. In this approach, probabilities are regarded as a means of quantifying information, and in particular the information contained in an experimental dataset about a specific model. This is encoded in the posterior, which combines prior, or external, information with the likelihood from the current data. In most realistic cases, the analytic or direct numerical evaluation of posterior probability distributions is impossible or infeasible, especially in cases that feature many parameters, due to the large volume of high-dimensional spaces. The wide-spread use of Bayesian methods has largely been driven by the availability of sampling algorithms, which can generate samples from a posterior distribution without exploring the full space. These samples can then be used to generate summary statistics like means and limits on individual parameters, or correlations between them. For an overview of the application of Bayesian inference in cosmology, see Trotta2008.
Christensen2001 proposed initial arguments for the use of Bayesian methods for this purpose. They argued for Markov chain Monte Carlo (MCMC) approaches due to their superiority in terms of sampling from, and converging to, the true posterior distribution in the limit for the sample size. The application of MCMC approaches in these early efforts were centered on the Metropolis-Hastings algorithm, which was named after work done by Metropolis1953 and, for the more general case, Hastings1970. The distinguishing feature of this method is the acceptance of new points in the Markov chain if the likelihood ratio of the proposed point and the last point is larger than one, and the probabilistic acceptance of points with a lower ratio if the latter is larger than a random number . This acceptance of less likely points dependent on the likelihoods leads to the sampling from the posterior distribution and, notably, does not require marginalization via the evidence. Knox2001 then followed the proposal of Christensen2001 to constrain the age of the universe to Gyr. Earlier work includes Saha1994, who made use of the Metropolis-Hastings algorithm for for galaxy kinematics, and Christensen1998, who employed the related Gibbs sampler for gravitational wave analysis (Geman1984).
Up to, and into, the new millennium, the Metropolis-Hastings algorithm remains the standard approach to cosmological parameter estimation, which was further supported by the development of a dedicated implementation in CosmoMC (Lewis2002). A variety of algorithms and codes are, however, available for different types of problems. The optimal choice depends on multiple factors, including the dimensionality of the problem, meaning the number of parameters to estimate, the evaluation speed, the need for Bayesian evidences, the availability of analytic derivatives, the ability to sample from marginal distributions, and the possibility and degree of using parallelization.
In more recent years, new MCMC sampling techniques were proposed and subsequently applied to cosmological parameter estimation. Examples include Population Monte Carlo (PMC) techniques introduced by Cappe2004 and Wraith2009, and used by Kilbinger2010 to develop CosmoPMC; affine-invariant MCMC ensembles by Goodman2010, which led to the publication of emcee by Foreman-Mackey2013 and CosmoHammer by Akeret2013; and renewed interest in Approximate Bayesian Computation (ABC) for likelihood-free inference based on simulations to introduce CosmoABC and abcpmc (Ishida2015; Akeret2015). Density estimation likelihood-free inference (DELFI) is a recently developed technique that trains a flexible density estimator to approximate the target posterior, circumventing the large number of simulations that traditional ABC approaches can require (Bonassi2011; Fan2013; Papamakarios2016). Using the JLA sample of 740 type SN Ia supernovae as described in Betoule2014, Alsing2018 subsequently deploy this method to estimate cosmological parameters. Their approach, however, makes a few simplifying assumptions, for example normally distributed priors and likelihoods. Other advanced methods, like the Hamiltonian Monte Carlo approach developed by Duane1987, have also been applied, for example by Hajian2007. These developments are driven by the computationally costly likelihood calculations involved in most MCMC algorithms, trying to alleviate this issue with a certain degree of parallelization due to the increased availability of cheap computing resources, faster convergence or, in the case of ABC, the circumvention of direct likelihood computations altogether. As such methods either fail to reduce the runtime enough for modern problems or have their own pitfalls, for example through an increased risk of introducing biases, the quest for highly parallelized and fast alternatives for cosmological parameter estimation continues. This need is further exacerbated by upcoming missions like LSST and Euclid requiring high-dimensional posterior approximations with a large number of required nuisance parameters predicted to vastly exceed previous missions (Amendola2018).
Nested sampling is a Bayesian take on numerical Lebesque integration for model selection introduced by Skilling2006, which estimates the relationships between the prior mass and the likelihood function. While targeting the calculation of Bayesian evidence, posterior samples are generated as a by-product, and the algorithm was quickly shown to require considerably fewer posterior evaluations (Mukherjee2006). Due to denser and sparser sampling from high-posterior and low-posterior regions, respectively, nested sampling provides increased efficiency when compared to previous MCMC methods. This has lead to extensions and implementations for applications in cosmology, notably CosmoNest by Liddle2006, MultiNest as described in Hobson2008 and Feroz2009, and PolyChord (Handley2015). In cosmology, such implementations have been used in areas as diverse as cosmic ray propagation models, cosmoparticle physics, and gravitational wave astronomy (Trotta2011; DelPozzo2012; Verde2013; DelPozzo2017; Wang2018). A comparison between nested sampling and state-of-the-art MCMC methods can be found in Allison2014, while an investigation of statistical uncertainties in nested sampling is provided by Keeton2011. Nested sampling has also been adopted by other fields of research, including GPU-accelerated implementations, for example in systems biology (Aitken2013; Stumpf2014).
The statistical literature, however, points out various issues of nested sampling methods that have prevented wide-spread adoption in statistics. Among these are the assumption that perfect and independent samples from a constrained version of the prior are drawn in each iteration, the underestimation of sampling errors due to the simulated-weights method it employs, and an asymptotic approximation variance that scales linearly with the dimensionality of a given parameter space (Chopin2010; Higson2018). In this paper, we use example likelihoods from the Dark Energy Survey (DES) collaboration’s analysis of lensing and clustering data, as presented in Abbott2018a. These calculations make use of the CosmoSIS and CosmoLike pipelines, which contain implementations of both Multinest and emcee (Zuntz2015; Krause2017).
In this paper, we propose a parallel-iterative algorithm to address these issues, making use of recent advances in the fields of statistics and machine learning. Our method starts with a preliminary approximation of the target distribution, either through a built-in affine-invariant MCMC ensemble or a user-provided initial sample guess. We then fit a non-parametric model to the sample and employ a variation of sampling-importance-resampling to iteratively move the samples toward the true distribution, repeating these steps until the process converges. In doing so, we also offer a user-friendly Python package to both the cosmology and the wider astronomy community, as well as a general parameter estimation tool for other disciplines dealing with the same issues. We test our implementation on the DES Year 1 (Y1) posterior, and on a fast approximation to it for extended tests.
The remainder of this paper is structured as follows: We cover the relevant methodology, which includes an overview of variational inference for Bayesian mixture models and truncated importance sampling, as well as the mathematical architecture of the proposed approach, in Section 2. In Section 4, we introduce an open-source implementation based on our method, explain computational considerations and parallelization, and provide a quickstart tutorial. Experiments for both toy examples and an approximation of the DES Y1 likelihood are covered in Section 5, together with cosmological parameter estimations runs on supercomputing facilities for the full DES Y1 likelihood. We present and discuss the results of these experiments in Section 6, and summarize our findings in the conclusion in Section 7.
2 Mathematical background
While Bayesian inference has earned its place as a powerful instrument for cosmological research, complex problems often suffer from the need to approximate probability densities that are difficult or outright infeasible to compute. Since Bayesian methods rely on the posterior density, approximations are a necessary evil. In the algorithm presented in this paper, we fit a mixture model to sample from the posterior using variational inference methods, while avoiding fixing the number of mixture components by using a Dirichlet process. We iteratively improve these samples using truncated importance sampling until a convergence criterion is fulfilled.
In this section, we introduce variational inference in Section 2.1, followed by Dirichlet processes and the stick-breaking procedure in Section 2.2. After an overview of sampling-importance-resampling and truncated importance sampling in Section 2.4, we introduce a novel method for parallel-iterative parameter estimation in Section 3.
2.1 Variational inference for model selection
Variational Bayesian methods were originally developed and explored in the context of artificial neural networks, and gained initial interest from research on inference in graphical models (Peterson1987; Peterson1989; Jordan1999). The use of variational Bayesian methods for inference is commonly known as variational inference (VI) and provides a faster and more scalable alternative to Markov chain Monte Carlo (MCMC) methods in many contexts; the main difference between them is that VI treats parameter estimation not as a sampling problem, but instead as an optimization problem. From a research point of view, these methods also garnered the interest of the statistics community because they are currently not as well understood as MCMC methods (Blei2017).
The Kullback-Leibler divergence is a central concept in VI and defines by how much a distribution diverges from another, or how similar it is. For a reference distribution and a proposal distribution , the can be expressed as
The fact that the is an asymmetric difference measure means that , which is due to its calculation as a directional loss of information.
In VI, the is used to find a best-fitting distribution to a set of samples. Let be a selected family of distributions, x and z observations and parameters, respectively, and a prior density that can be related to observations via the likelihood to calculate the posterior . The family member that best matches the posterior can be found in the framework of an optimization problem, finding with some specified tolerance the value of
Calculating this quantity directly is often infeasible, since it is equivalent to measuring the Bayesian evidence. Instead, VI methods (equivalently) maximize an alternative quantity, the evidence lower bound (ELBO),
which is numerically easier to calculate than the . The ELBO also delivers a lower bound for the evidence, which is the reason for the utility of VI for model selection, as covered in Blei2017. A more extensive introduction to VI for the interested reader can be found in Murphy2012.
2.2 Dirichlet processes and stick-breaking
Instead of the more traditional approach of fixing the number of components in the mixture model that we use to model the posterior, we determine the component number from the sample itself at each iteration. This approach employs a Dirichlet Process (DP) as a prior on the number of parameters, which enables the use of a suitable number of components during each step, meaning that changes between iterations are not forced to use the same components.
Developed by Ferguson1973, DPs are distributions of distributions, featuring a base distribution and a scaling parameter , and with realizations denoted as . This area has important applications as the prior in infinite mixture models, and gained new traction in both statistics and machine learning in recent years (Gershman2012). The DP mixture model presented originally by Antoniak1974 takes as the distribution parameter of observation and uses the discrete nature of the base distribution to view the DP mixture as an infinite mixture model. For samples s from such a DP mixture, with sample size , the predictive density with is
As the computation of that density is, again, infeasible, Blei2006 introduce the use of VI for DP mixtures. Bayesian takes on mixture models employ a prior over the mixing distribution as well as over the cluster parameters, with the former commonly being a Dirichlet and the latter being a Gaussian distribution in our case. Given the discrete nature of random measures drawn from a DP, a mixutre of the latter can be viewed as a mixture model with an unbounded number of components (Blei2006).
The Bayesian nonparametrics approach employs the stick-breaking process by Sethuraman1994, which exploits the discrete nature of DPs to calculate the probability mass function, and can be used for Bayesian Gaussian mixtures with an undetermined number of Gaussians. The name is based on the analogy of breaking a stick of unit length into infinite segments by consecutively breaking off , , etc. from the stick until the remainder is truncated to recover a finite-dimensional representation. The truncated variational distribution is then used to approximate the posterior of an infinite DP mixture. As a mathematical description of the subsequent application of VI to DPs with stick-breaking would go beyond the scope of this overview, we refer the reader to Blei2006. A less concise introduction to DPs and Bayesian nonparametrics in general, as well as its applications, is provided in Hjort2010.
As the posterior distribution, given a DP mixture prior, cannot be directly calculated, VI offers a deterministic approach to approximate them. In this paper, we employ the mean-field family within VI to optimize the , using this approach to approximate the joint posterior for parameters of an infinite Gaussian mixture, made finite to a maximum number of components through stick-breaking.
2.3 Importance Sampling
Importance sampling was described early by Kahn1953 in the context of sample size reduction in Monte Carlo methods and continues to inspire a wide array of extensions. This includes physics-specific techniques like umbrella sampling for difficult energy landscapes by Torrie1977 and, more recently, methods to alleviate issues with poorly approximated proposal distributions (Ionides2008). It is also a staple in cosmological parameter estimation, for example in Wraith2009 and Kilbinger2010. Generally, the basic method is a way to estimate distribution properties if only samples from a different, often approximated, distribution are given. Let be the target distribution, an approximate (or proposed) distribution, and some function. The expectation of can then be computed as
with as the number of drawn samples. The ratios in this equation, given as
are called the importance weights or importance ratios and are central to the method.
Sampling-importance-resampling (SIR) is a two-step approach in which the importance weights for a set of samples are calculated, after which an equally-sized subset of these samples is generated by drawing from them with probabilities per sample indicated by the normalized importance weights. For a more in-depth introduction to importance sampling and other related sampling methods, see Bishop2006.
2.4 Counteracting high-weight samples
One issue with this approach is the possibility of overly dominant samples, meaning points with disproportionately high posterior values in comparison to the rest of a set of model samples. During the importance resampling step, this dominance leads to copies of these samples being over-represented, resulting in sets that are too narrow in their densities. We address this issue with truncated importance sampling, an extension of importance sampling that truncates weights of high-value samples based on the total number of drawn samples, with guarantees for finite variance and mean-square consistency under weak conditions (Ionides2008). For a set of samples, proposal distribution posteriors , actual posteriors and a set truncation value with justifications to be set at , the weight of a single sample is updated according to
where is the mean of all importance weights for the sample. With this extension applied to SIR, the weighted drawing of samples is limited by the truncation value. This change improves the behaviour of importance sampling during the early part of the algorithm described below, when the estimated distribution is a poor approximation to the desired posterior , and alleviates the issue of working with relatively small sample sizes for high-dimensional parameter spaces.
3 The Gaussbock Algorithm
Based on Bayesian nonparametrics and machine learning as described in Sections 2.1 and 2.2, we introduce an algorithm that uses variational inference on an infinite Dirichlet process approximated via stick-breaking to fit variational Bayesian Gaussian mixture models (GMMs) in an iterative manner. This algorithm offers a highly adaptive and embarrassingly parallel way to approximate high-dimensional posteriors with computationally expensive likelihoods.
The idea behind our approach is to start from an initial sample guess, either from existing data or a short run of another sampler such as emcee. Based on the work on nonparametric VI by Gershman2012, our algorithm uses a variational Bayesian GMM due to its ability to automatically determine the number of Gaussians required to produce a good fit by stick-breaking an infinite Dirichlet process mixture. For this reason, only the provision of a maximal number of Gaussians is required. The algorithm then determines means and variances for the optimal number of Gaussians given a sample and fitting tolerance. This is followed by drawing a new sample from the fitted mixture model, and a truncated SIR step to move the sample distribution further toward the true the posterior density. These steps are then repeated in an iterative manner until convergence, which is assessed from the change in the variance of importance weights at the end of each iteration:
Fit a variational Bayesian GMM to the sample,
draw a new sample from the newly fitted model,
perform an SIR step for a weighted sample, and
check inter-iteration variances for convergence.
We use a dynamically shrinking tolerance for the model-fitting step. Let be the tuple denoting the initial and final model-fitting tolerances, with , and let be the maximum number of iterations, then the tolerance for a given iteration is
This approach is related to the previously mentioned PMC algorithms initially introduced by Cappe2004, and applied to cosmological inference in Kilbinger2010. It differs, though, by the nonparametric nature of the model, which eliminates the bias present in the predetermined number of distributions in classical GMMs. It also adds the weight truncation to reduce the influence of overly dominant samples with high posterior values in relatively small samples. Our method bears motivational similarity, although considerable methodological differences, to CosmoABC, while not being subject to the potential pitfalls of forward-simulation inference in ABC (Ishida2015).
In Algorithm 3, we provide a more complete pseudo-code representation of the most relevant parts of the approach described in this paper, which we name Gaussbock. For this algorithm, we let be the array of tuples representing the allowed ranges (min, max) per dimension, that is, per parameter. Furthermore, let be the maximum number of iterations, the number of samples to be drawn from each iteration’s model, the number of samples returned after termination, the maximum number of Gaussians available for approximating the posterior distribution, and a safety margin parameter greater than one to draw additional GMM samples in case some fall outside the parameter bounds. Finally, let be the empirical data used for calculating the true likelihood. The specifics of the variational Bayesian GMM (VBGMM) with reasonable default settings, like the prior of the covariance distribution and the parameter initialization for the VBGMM, are omitted in order to keep the pseudo-code concise.
As (mostly adaptive) defaults are used for the settings of Gaussbock, only the initial approximative sample set , the number of iterations , and the handle of a function to compute have to be provided with regard to the above pseudo-code. In addition, if no is provided, the implementation described in Section 4 will automatically run an affine-invariant MCMC ensemble to procure that initial set of posterior-space samples. Since the determination of convergence is a common issue in MCMC methods, Gaussbock uses a convergence threshold that terminates the iterative fitting-resampling procedure if reached before the maximum number of iterations. For this purpose, we measure the difference in inter-iteration weight variances , which takes the form
Here, the average logarithmic importance weight variance is denoted as , providing the arithmetic mean over the dimensionality , meaning the number of parameters.
4 Software implementation
Table 0. \Hy@raisedlink\hyper@@anchor\@currentHrefGaussbock inputs. The table lists all 19 possible inputs that can be set by the user, as well as a short explanation for each input, with the first three being required. The remaining 16 optional inputs are marked with an asterisk before their name and rely on either best-practice values or adaptive defaults that are computed based on the required inputs to generally achieve desirable performance for a wide array of problems.
|1. parameter_ranges||The lower and upper limit for each parameter|
|2. posterior_evaluation||Evaluation function handle for the posterior|
|3. output_samples||Number of posterior samples that are required|
|4. *initial_samples||Choice of emcee or a provided start sample||[‘automatic’, , ]|
|5. *gaussbock_iterations||Maximum number of Gaussbock iterations|
|6. *convergence_threshold||Threshold for inter-iteration convergence checks||None|
|7. *mixture_samples||Number of samples drawn for importance sampling|
|8. *em_iterations||Maximum number of EM iterations for the mixture|
|9. *tolerance_range||The range for the shrinking convergence threshold||[, ]|
|10. *model_components||Maximum number of Gaussians fitted to samples||ceil|
|11. *model_covariance||Type of covariance for the GMM fitting process||‘full’|
|12. *parameter_init||How to intialize model weights, means and covariances||‘random’|
|13. *model_verbosity||The amount of information printed during runtime||1|
|14. *mpi_parallelization||Whether to parallelize Gaussbock using an MPI pool||False|
|15. *processes||Number of processes Gaussbock should parallelize over||1|
|16. *weights_and_model||Whether to return importance weights and the model||False|
|17. *truncation_alpha||Truncation value for importance ratio reweighting||2.0|
|18. *model_selection||Type of model used for the fitting process||‘gmm’ if , else ‘kde’|
|19. *kde_bandwidth||Kernel bandwidth used when fitting via KDE||0.5|
In order to make this algorithm readily available, we have released a Python 3 package incorporating the complete Gaussbock algorithm. The package is installable via pip from the Python Package Index333https://pypi.org, while documentation and source code are available in a public repository444https://github.com/moews/gaussbock.
Figure 4 shows the schematic workflow of Gaussbock, with a choice between an automated initial posterior approximation and a user-provided sample guess, as well as the option to return importance weights and the final fitted model. The automated initial approximation makes use of an affine-invariant MCMC ensemble, as introduced by Goodman2010, through the package emcee by Foreman-Mackey2013 and with parameters like the number of walkers being automatically determined based on the required function inputs. The only required inputs to the tool’s main function are the lower and upper limits for each parameter (‘parameter_ranges’), the handle of a function that accepts a point in the problem’s parameter space and returns its log-posterior value (‘posterior_evaluation’), and the desired number of posterior samples to be returned (‘output_samples’).
An overview of settable inputs is shown in Table 4. We strongly encourage users to provide parameter ranges that are scaled to the interval when setting a threshold for the optional convergence determination (‘convergence_threshold’) due to its mean variance-based functionality. When setting a convergence threshold, we recommend a value of as a best-practice choice that takes increased dimensionalities into account when using the built-in convergence criterion. The implementation uses schwimmbad, a library for parallel processing tools, to provide MPI parallelization on parallel computing architectures (Price-Whelan2017). The use of MPI can be activated with the optional boolean input (‘mpi_parallelization’) being set to ‘True’. Alternatively, for running the algorithms across multiple cores locally, the optional input ‘processes’ can be set to the number of desired cores to be used. The initial sample to start from can be provided by the user, for example through sampling a best-guess approximation or using the posterior from previous research (‘initial_samples’).
An input of special importance is the ability to set the variable parameter for truncated importance sampling (‘truncation_alpha’), the ideal value of which can change based on the difficulty of the posterior approximation problem. By default, the recommended value of is used (Ionides2008). When dealing with, for example, high-dimensional truncated Gaussians or similarly hard-to-approximate shapes, a value of up to can enforce a stronger truncation to combat high-weight samples. Similarly, the truncation value can be set down to a minimum of for weaker importance weight truncation. Interlinked with this input are the dimensionality of the problem and number of samples drawn from a fitted model in each iteration (‘mixture_samples’), as a lower number of samples in a higher-dimensional parameter space increases the odds of importance weights with comparatively high values due to sparse samples. Time requirements and the number of available cores are the limiting factors for such considerations, which is discussed in the experiments in Section 5.
The algorithm’s runtime can be further influenced by limiting the maximum number of Gaussians to be used for fitting a VBGMM during each iteration (‘model_components’). By default, this input is determined based on the number of parameters to be estimated, but user knowledge about the complexity of the target distribution can inform the requirement for lower or higher maximums. Low-dimensional problems with trigger the use of kernel density estimation (KDE) instead of a VBGMM by default, as this density estimation approach is quite powerful in such scenarios, but faces issues in higher-dimensional problems (OBrien2016). The use of KDE or a VBGMM can, however, be forced by the user by setting the respective optional input (‘kde_bandwidth’) to either ‘kde’ or ‘gmm’. The bandwidth used for the KDE functionality can be customized with an optional input (‘kde_bandwidth’). We advise the use of KDE for low-dimensional problems due to the ability to catch hard-to-approximate posteriors in combination with our iterative method, which we demonstrate in Section 5.4.
DES is an imaging survey that covers 5000 square degrees of the southern celestial hemisphere, operating a wide-field camera on the 4-meter VÃctor M. Blanco Telescope located at the Cerro Tololo Inter-American Observatory (Abbott2016a). The survey generates cosmological parameter constraints using multiple different probes, including galaxy clustering and lensing, cluster counts, and supernova measurements. Preliminary constraints from DES Science Verification (SV) data are presented in Abbott2016b and Kacprzak2016 while, more recently, results and data for DES Y1 observations are described by Abbott2018a and have been made public555https://des.ncsa.illinois.edu/releases/y1a1.
In this work, we use the Y1 weak lensing and galaxy clustering measurements as a test of Gaussbock. These measurements consist of a set of 2D two-point correlation functions of galaxy shape and position (“3x2pt”) in tomographic bins by redshift. These functions can be predicted from the cosmological matter power spectrum and redshift-distance relation, both of which are sensitive to the underlying cosmological parameters, and especially to the matter density fraction and the variance of cosmic structure . DES presents constraints on these parameters comparable to those obtained from the CMB with Planck (Aghanim2018). For our experiments, we use the baseline CDM model with varied neutrino density as our test likelihood.
The sampling methods used in the main DES analysis are discussed in Krause2018; they use both the emcee affine-invariant sampler and the MultiNest nested sampling method, and found close agreement between the two methods.
In Section 5.1, we describe a fast-likelihood approximation of the DES Y1 posterior, followed by a performance test for Gaussbock. We explore scaling behavior of our implementation on the same approximation with experiments in Section 5.2. In Section 5.3, we run Gaussbock on the full DES Y1 posterior to test both the performance in real scenarios and the the ability to run fully parallelized via MPI on supercomputing facilities. Lastly, the built-in KDE functionality for low-dimensional problems is stress-tested in Section 5.4.
5.1 Approximating the Dark Energy Survey posterior
The real DES Y1 likelihood is slow to evaluate, taking more than 10 seconds per sample. In order to enable experiments that target controlled assessment and scaling behavior, we use an approximation to the DES Y1 posterior with a multivariate truncated Gaussian distribution, for which we employ the mean and covariance values for 26 cosmological and nuisance parameters, as well as their limits from the respective DES data release. This approach results in an extremely fast parameter set evaluation based on a DES Y1 approximation suitable for our purposes. A perfectly Gaussian approximation to the posterior would be an artificially easy test of a model that fits Gaussians; our posterior is truncated within a few sigma of the peak in many of its parameters, and thus provides a reasonable challenge.
As discussed in Section 4, we use an increased truncation value for the SIR step of Gaussbock, which we set to , and a convergence threshold of that follows the previously outlined best-practice guidelines and triggers the use of the built-in convergence determination. The number of samples per iteration is set to , with the reasoning behind this choice further outlined in Section 5.2. As we want to weight the returned posterior samples with their importance weight, we activate the additional return of the final model and importance weights. Apart from these settings, we use the default behavior of Gaussbock by not providing other optional inputs. Table 5.1 shows the lower and upper limits for cosmological and nuisance parameters employed in our approximation.
Table 0. \Hy@raisedlink\hyper@@anchor\@currentHrefCosmological and nuisance parameter limits for a fast approximation of the DES Y1 posterior. The lower and upper limits shown as open intervals closely follow prior distribution features previously used by DES for data from the first year of observations (Abbott2018a).
|Lens galaxy bias|
The results of this experiment are shown in Figure 5.1 and demonstrate the abilily of Gaussbock to recover correct constraints. Starting from a short and unconverged emcee chain, for which distributions are shown in yellow, the importance-weighted posterior samples marked in blue closely match the long-run emcee samples highlighted in red. The achieved level of agreement is good enough to make posterior contours and distributions for the target distribution and the importance-weighted samples hard to separate by eye. While the distributions for unweighted posterior samples in green show a good agreement with the long-run samples, weighting the output samples with the optionally returned importance weights pushes the sample distributions further toward to target posterior, thus validating the additionally provided functionality related to KDE for low-dimensional parameter estimation. While this experiment is based on an approximation of the full DES Y1 posterior, it offers a suitable testbed to prepare for the full-scale run described in Section 5.3.
Another factor of interest is the iterative behavior of our algorithm, as Gaussbock is supposed to continuously improve the agreement of its internally generated samples with the true posterior distribution. In Figure 5.1, we illustrate this behavior, showing the gradual improvement of the constraints. The plots depict the morphing and shifting behavior of Gaussbock samples for the number of iterations as even integers in the interval [0, 10]. The cosmological parameters chosen for this experiment are the same as in the left-hand panel of Figure 5.1.
The evolution across the different panels showcase the algorithm’s ability to start from a very rough sample guess and gradually move toward the target distribution. The latter is closely approximated by an extremely long emcee chain as an ideal sample. As demonstrated through this visualization, the algorithm first shifts generated samples toward the true mean with a lower-variance distribution, followed by incrementally spreading out to create a close fit to the target distribution.
5.2 Exploration of scaling behavior
In algorithms designed for the use with highly parallelized architectures, as well as in approaches for high-dimensional estimation problems, the question of how the algorithm in questions scales for different factors is important. For this reason, we now explore the scaling behavior of our algorithm. We quantify the time to convergence using the criterion introduced in Section 3, measured on the fast DES Y1 approximation covered in Section 5.1.
Higher-dimensional problems can, in general, be assumed to lead to a greater complexity of the estimation procedure, forcing Gaussbock to morph and shift the distribution in each iteration across more dimensions. We test our implementation for dimensionalities , up to the full set of cosmological and nuisance parameters in our DES Y1 approximation. We perform this parameter estimation 50 times for each number of dimensions to create confidence intervals, with the respective subset of parameters being randomly selected. In each case, we use the convergence threshold . The left panel of Figure 5.2 plots the number of iterations required to reach convergence versus the number of estimated parameters, showing the rise with problems of increased dimensionality.
The 95% confidence intervals around the average number of iterations to convergence highlight the larger variance with increasing numbers of parameters. The average number of 26.6 iterations for estimating the full set of 26 parameters provides an indicator for the full DES Y1 posterior computation in Section 5.3.
The second question in terms of scaling behavior targets the embarrassingly parallel part of our algorithm, as we can vary the number of samples drawn at each iteration. Although the ability to parallelize across large numbers of cores is one of the strengths of Gaussbock, and while access to parallel computing architectures is wide-spread in modern cosmology, the number of available cores for a given task still faces upper limits. As described in Section 4, a higher number of samples drawn from a given iteration’s fitted model is generally preferable, which translates to a preference for a higher number of cores due to the subsequent parallelization of the truncated SIR step. This poses the question of the scaling behavior of this benefit, as the required number of iterations to convergence is expected to decrease with a higher number of samples per iteration.
The right panel of Figure 5.2 shows the scaling behavior of the required number of iterations to convergence versus the number of samples drawn from the fitted model during each iteration. We perform 50 Gaussbock runs per number of samples to create confidence intervals, in the interval [5000, 40000] and in steps of 5000.
Let be the number of required iterations to convergence, the number of available cores, and the number of used samples per iteration. Then the total number of posterior value calculations per core over the course of a Gaussbock run is . Increasing numbers of samples constrain the variance of required iterations, and the dashed black line in the right panel of Figure 5.2 indicates an optimal trade-off (in terms of total core time) between the two variables as at for the number of samples, which informs our input choices in Section 5.1. This visualization also bears resemblance to the ‘elbow criterion’ in cluster analysis, which determines the optimal number of clusters by plotting that number against the explained variance (Thorndike1953).
Lastly, we investigate the convergence behavior of Gaussbock as a follow-up to Figure 5.1, to ensure that both the convergence check itself and the recommended calculation of a convergence threshold behave as intended. The algorithm is run on the same parameter estimation problem as in Section 5.1, for a total of 27 iterations to cover the previously computed mean number of iterations to convergence of 26.6. As for previous tests, we run this experiment 50 separate times to generate 95% confidence intervals. The results are shown in the left panel of Figure 5.2, starting after the first 10 iterations to cover fine-tuning behavior after the initial morphing and shifting explored in Figure 5.1, and with the dashed black line indicating the convergence threshold set to . The figure demonstrates both convergence behavior for the threshold calculation and narrow confidence intervals for multiple experiments.
Similarly, the right panel of Figure 5.2 shows the same experimental setup with 50 runs in order to recover confidence intervals, but for the introduced in Section 2.1, between a given iteration’s posterior approximation as the proposal distribution and the target distribution as the reference distribution. The plot shows a remarkably consistent and well-constrained behavior, providing complementary evidence for the evolution over increasing numbers of iterations shown in Figure 5.1 and the scaling behavior shown in the left panel of Figure 5.2, and demonstrates that the built-in convergence criterion does not level out before this alternative measure based on differences to a known target distribution.
5.3 The full Dark Energy Survey posterior
In order to expose our method to a fully realistic experiment without approximations, we apply Gaussbock to the full DES posterior from the DES Y1 experiments and data release (Abbott2018a; Abbott2018b; Krause2018). We use the public CosmoSIS implementation of the public Y1 likelihood, which includes CAMB as descibed by Lewis2002 and Howlett2012, and Halofit as introduced in Smith2003 and Takahashi2012 to compute distances and matter power spectra, CosmoSIS-specific modules for the Limber integral and other intermediate steps, and Nicaea666http://www.cosmostat.org/software/nicaea for the computation of real-space correlations from Fourier space values (Kilbinger2009). Since the public implementation of the Y1 likelihood differs very slightly from the released chains, we rerun the model referred to as d_l3 in the public DES Y1 chains using MultiNest for an identical comparison. The experiment starts with the same initial sample guess via a short-chained emcee run that we use for our fast approximation of the DES Y1 posterior in Section 5.1, demonstrating that our approach is able to start from approximative guesses that only partially fall within the vicinity of the target posterior and are not necessarily based on calculations using the actual target in question.
Making use of Gaussbock’s innate embarrassing parallelism, we run this experiment on supercomputing facilities of the National Energy and Scientific Research Computing Center (NERSC)777https://www.nersc.gov (He2018). We run on 32 nodes of the Cori computer, for a total of 1024 cores and 2048 threads. The results below were generated in approximately two hours in this configuration, showcasing the total runtime advantage of our approach. With the runtime scaling being inversely linear with the number of cores, up to the number of samples used per iteration due to the model-fitting process not requiring a lot of time, up to 15000 cores can be used in an idealized scenario for our experimental setup to gain a further order-of-magnitude reduction. In order to make use of existing posterior implementations, we employ CosmoSIS to use Gaussbock with the DES Y1 posterior (Zuntz2015).
Table 0. \Hy@raisedlink\hyper@@anchor\@currentHrefCosmological parameters for DES Y1 data. The table shows figures of merit for common cosmological parameters used in the original DES Y1 experiments, with the latter’s implementation of MultiNest and, for comparison, the results for a highly parallel Gaussbock run.
Table 5.3 lists the cosmological parameters as estimated by both Gaussbock and its comparison baseline, meaning the fiducial MultiNest run, demonstrating a satisfactory level of agreement for both means and credible intervals. In addition to the cosmological parameters shown in the experiments for Figures 5.1 and 5.3, the table also includes the scalar spectral index and the massive neutrino density , covering the full set of cosmological parameters previously listed in Table 5.1.
Figure 5.3 shows the posterior contours for both the d_l3 rerun with MultiNest and the Gaussbock result in red and blue, respectively. Both matter and baryon density parameters, and , are shown to match the baseline computation well, whereas the Hubble parameter and scalar amplitude of density fluctuations are in reasonable agreement, but do not correctly recover the tails of the posterior distribution. An exploration of the 26-dimensional approximation shows that Gaussbock accurately models the parameters which are well-constrained, but fails to recover the tails on unconstrained parameters like and that have very broad intervals, as listed in Table 5.1. Where possible, it might help to provide narrower constraints for such parameters. In addition, Figure 5.3 shows the joint posterior of the two intrinsic alignment parameters, and in the right panel. The results demonstrate the ability of Gaussbock to recover non-Gaussian shapes of correlated parameters to a high degree of accuracy, as can be seen in the 2D posterior shapes for the fiducial MultiNest and Gaussbock runs, as well as in the agreement between histograms in the figure.
While the results are not in near-perfect agreement, as is the case for the fast truncated Gaussian approximation in Section 5.1, a trade-off between considerably reduced runtime and accuracy is to be expected analogous to the No Free Lunch Theorem in optimization (Wolpert1997). The described experiment on the full DES Y1 posterior makes use of Gaussbock’s adaptive default behavior and, for the number of samples per iteration, is based on our fast approximation, so fine-tuning to a specific application case can be expected to further improve the performance of the algorithm. Other reasons for the results not showing the same goodness of fit for all parameters, as observed in Section 5.1, are a diminished smoothness of posteriors and less Gaussian tails, which we discuss in Section 6.
5.4 Stress test for low-dimensional cases
As outlined in Section 4, KDE is a powerful density estimation technique, but faces issues in higher-dimensional problems (OBrien2016). In this experiment, we exemplify the built-in default to use KDE for problems in which , allowing Gaussbock to make use of the method most suitable to a given problem. For this purpose, we construct a posterior of three approximately equilateral triangles with a flat posterior surface, meaning that posterior values are uniform across the triangle shapes. Due to the convergence criterion of Gaussbock, which we discuss in Section 3, being geared toward the use of a VBGMM as its primary application in high-dimensional setting, we set the number of iterations to 20. We let the initial sample guess be generated automatically with the same number as for previous experiments in Section 5.1, and let Gaussbock use its default behavior for optional inputs.
The results of this low-dimensional parameter estimation experiment is shown in Figure 5.4, with 95% credible intervals for the flat-surface posterior demonstrating the ability of Gaussbock to approximate complex shapes with pronounced edges and corners. The three separate triangles are clearly reconstructed through the importance-weighted samples generated by the algorithm, validating its integrated KDE functionality for low-dimensional estimation problems.
The primary advantage of our approach is the considerable reduction in the required runtime, given a large-enough number of cores available for parallelization. This strength offers a way to tackle increasing complexities in cosmological parameter estimation for current and upcoming surveys such as LSST and Euclid (Amendola2018). Since cosmological parameter estimation efforts rely on computationally costly posterior evaluations, the embarrassing parallelization of their calculation allows for an immense speed-up in comparison to standard MCMC approaches. This reduction in total runtime comes, however, at the cost of an increase in the required core time, meaning the number of computing hours necessary to achieve suitable results. For this reason, and assuming a sufficiently costly posterior evaluation, making use of Gaussbock’s parallelization capabilities is a requirement rather than an optional feature, as demonstrated in Section 5.3.
A direct comparison to MCMC methods is a double-edged sword in that such methods, run for a very large number of steps, provide a close fit to the true posterior. The downside of MCMC approaches is that they tend to not scale well with the number of dimensions, and that they are only parallelizable over the number of walkers. This means that computationally expensive likelihoods provide an obstable to implementations such as emcee (Foreman-Mackey2013). While nested sampling methods, for example MultiNest by Skilling2006 and CosmoNest by Liddle2006, circumvent this restriction by requiring fewer posterior evaluations, it relies on assumptions about perfect and independent samples and underestimates an asymptotically Gaussian sampling error (Chopin2010).
As mentioned in Section 5.3, posteriors based on real-world survey data may have a less smooth posterior surface, which can hamper the effectiveness of the truncated SIR step used in our approach. Adjusting the ‘truncation_alpha’ input can alleviate this issue for isolated samples with higher posterior values, although a more effective solution is to increase the number of samples drawn from the posterior approximation of a given iteration of the algorithm. This approach does, in turn, require either a correspondingly larger number of cores or additional runtime. Alternatively, the initial sample guess to which the first-iteration model is fitted can be based on a longer-chain emcee run. As a result, this approach offers a better approximation of the posterior to start from, as it more closely resembles the target distribution and leads to broader coverage of relevant areas. We hope that the presented work will lead to further investigations of this and related parallelized iterative approaches to parameter estimation, alleviating the issues arising from increased compuational demands in inference based on modern surveys.
Apart from cases with sufficiently smooth posteriors and well-constrained parameters, Gaussbock also offers a way to quickly approximate a posterior to reasonable degrees. For this purpose, we recommend using either uniform-random samples from an -sphere scaled to the admissible ranges or, if feasible, samples from a better-suited distribution like an -dimensional Gaussian to provide an initial sample guess covering the posterior area. The reason for such approaches is the elimination of the need for computationally more expensive sample guess generators such as short-chained emcee runs, which require costly evaluations of the posterior. While short chains are fast in comparison to exhaustive runs of MCMC methods, runtimes should be kept to a minimum for fast approximations in order to provide an edge in speed over alternative approaches.
An additional use case pertains to lower-dimensional problems, or scenarios with posterior evaluations that are sufficiently cheap to compute, and offers a way to achieve very tight fits to posteriors that are hard to approximate and feature clean cuts, with an example given in Section 5.4 and one commonly-encountered example of such posterior shapes being truncated Gaussians. The suitability for the latter type also extends to higher dimensions, as we demonstrate with the truncated Gaussian approximation of the 26-dimensional DES Y1 posterior in Section 5.1.
Unlike in most MCMC methods, the final mixture model is an optional output of our implementation, which can be saved and used again at a later point. It can act as an approximate but analytic description of the posterior, allowing, for example, the subsequent drawing of an arbitrary number of samples for which importance weights can be calculated and which can be easily disseminated. In this context, our approach offers a way to easily exchange and compare posterior approximations based on different datasets, with mixture models whose components can be combined.
For common problems faced by contemporary research in cosmology, Gaussbock offers a considerable speed-up. This is especially relevant for upcoming missions with larger numbers of parameters, for which our approach provides a way to quickly compute high-fidelity posterior approximations and the underlying mixture model. While, in this work, we use a wrapper to run Gaussbock through CosmoSIS on NERSC facilities, a complete integration into CosmoSIS will further enhance the ease of access to our methodology.
In terms of its internal functionality, our approach inherently lends itself to combating issues with defaulting cores, as the failure or a subset of processes to return importance values can be safely ignored. The respective parameter sets can simply be omitted from the set of samples used to approximate the posterior in a given iteration, using the large-enough amount of remaining parameter sets to fit the model in a given iteration. While the capability to do so is not part of our implementation and is primarily of interest for large-scale cloud computing, our code easily lends itself to being extended toward this safety redundancy.
In this paper, we introduce and apply Gaussbock, a novel approach to cosmological parameter estimation that makes use of recent advances in machine learning and statistics. By coupling variational Bayesian GMMs with a truncation-based extension of importance sampling in an iterative approach with a convergence criterion, our method offers an embarrassingly parallel way to achieve high-speed parameter estimation for problems with computational expensive posterior calculations.
We initially test Gaussbock on a fast approximation of the DES Y1 posterior to demonstrate its capabilities on a high-dimensional realistic example, and to investigate scaling relations and the effectiveness of the convergence criterion, both of which prove to be well-behaved. We then apply our method to the full DES Y1 posterior, making use of Gaussbock’s built-in MPI capabilties to run it on NERSC supercomputing facilities. The results showcase the immense speed-up that constitutes the primary strength of our method, achieving a good fit to the original DES approach of using MultiNest. While achieving excellent fits in most cases across our experiments, we observe that less Gaussian posteriors of unconstrained parameters result in a slightly worse fit to the tails of the distribution and discuss the potential issues arising from less smooth posterior surfaces. In addition, we stress-test the optional low-dimensional KDE-based functionality of Gaussbock in order to demonstrate the ability to achieve tight fits to hard-to-approximate posteriors.
We implement Gaussbock as a pure-Python package to conduct our experiments described in this paper. In doing so, we also provide the astronomy community with a user-friendly and readily installable implementation of Gaussbock, bearing the same name. While our method is developed specifically with contemporary parameter estimation problems in cosmology in mind, it represents a general-purpose inference tool applicable to many problems dealing with high-dimensional parameter estimation with computationally costly posteriors. As a result, our work contributes to the wider field of estimation theory in addition to current and upcoming astronomical surveys.
We thank members of the Dark Energy Survey collaboration for making DES data publicly available. BM acknowledges the support of a Principal’s Career Development Scholarship from the University of Edinburgh. JZ acknowledges a Chancellor’s Fellowship, also from the University of Edinburgh.