Recent advances in probabilistic modelling have led to a large number of simulation-based inference algorithms which do not require numerical evaluation of likelihoods. However, a public benchmark with appropriate performance metrics for such ‘likelihood-free’ algorithms has been lacking. This has made it difficult to compare algorithms and identify their strengths and weaknesses. We set out to fill this gap: We provide a benchmark with inference tasks and suitable performance metrics, with an initial selection of algorithms including recent approaches employing neural networks and classical Approximate Bayesian Computation methods. We found that the choice of performance metric is critical, that even state-of-the-art algorithms have substantial room for improvement, and that sequential estimation improves sample efficiency. Neural network-based approaches generally exhibit better performance, but there is no uniformly best algorithm. We provide practical advice and highlight the potential of the benchmark to diagnose problems and improve algorithms. The results can be explored interactively on a companion website. All code is open source, making it possible to contribute further benchmark tasks and inference algorithms.
references \defaultbibliographystylehumannat \tcbuselibraryskins \newtcolorboxtboxenhanced, notitle, standard jigsaw, boxrule=0.1mm, top=3mm, bottom=4mm, colframe=black!50!white, colback=black!2, coltitle=black, fonttitle=, colbacktitle=black!2, subtitle style=frame hidden, opacityframe=0., #1
Jan-Matthis Lueckmann, Jan Boelts, David S. Greenberg, Pedro J. Gonçalves, Jakob H. Macke
Many domains of science, engineering, and economics make extensive use of models implemented as stochastic numerical simulators (gourieroux1993indirect; ratmann2007using; alsing2018; brehmer2018constraining; karabatsos2018approximate; gonccalves2019training). A key challenge when studying and validating such simulation-based models is the statistical identification of parameters which are consistent with observed data. In many cases, calculation of the likelihood is intractable or impractical, rendering conventional statistical approaches inapplicable. The goal of simulation-based inference (SBI), also known as ‘likelihood-free inference’, is to perform Bayesian inference without requiring numerical evaluation of the likelihood function (sisson2018_chapter1; cranmer2019). In SBI, it is generally not required that the simulator is differentiable, nor that one has access to its internal random variables.
In recent years, several new SBI algorithms have been developed (e.g., gutmann2015; papamakarios2016; lueckmann2017; chan2018; greenberg2019; papamakarios2019a; prangle2019distilling; brehmer2020a; hermans2019; jarvenpaa2020; picchini2020; rodrigues2020; thomas2020), energized, in part, by advances in probabilistic machine learning (rezende2016; papamakarios2017; papamakarios2019c). Despite—or possibly because—of these rapid and exciting developments, it is currently difficult to assess how different approaches relate to each other theoretically and empirically: First, different studies often use different tasks and metrics for comparison, and comprehensive comparisons on multiple tasks and simulation budgets are rare. Second, some commonly employed metrics might not be appropriate or might be biased through the choice of hyperparameters. Third, the absence of a benchmark has made it necessary to reimplement tasks and algorithms for each new study. This practice is wasteful, and makes it hard to rapidly evaluate the potential of new algorithms. Overall, it is difficult to discern the most promising approaches and decide on which algorithm to use when. These problems are exacerbated by the interdisciplinary nature of research on SBI, which has led to independent development and co-existence of closely-related algorithms in different disciplines.
There are many exciting challenges and opportunities ahead, such as the scaling of these algorithms to high-dimensional data, active selection of simulations, and gray-box settings, as outlined in cranmer2019. To tackle such challenges, researchers will need an extensible framework to compare existing algorithms and test novel ideas. Carefully curated, a benchmark framework will make it easier for researchers to enter SBI research, and will fuel the development of new algorithms through community involvement, exchange of expertise and collaboration. Furthermore, benchmarking results could help practitioners to decide which algorithm to use on a given problem of interest, and thereby contribute to the dissemination of SBI.
The catalyzing effect of benchmarks has been evident, e.g., in computer vision (russakovsky2015imagenet), speech recognition (hirsch2000aurora; wang2018glue), reinforcement learning (bellemare2013arcade; duan2016benchmarking), Bayesian deep learning (filos2019systematic; wenzel2020), and many other fields drawing on machine learning. Open benchmarks can be an important component of transparent and reproducible computational research. Surprisingly, a benchmark framework for SBI has been lacking, possibly due to the challenging endeavor of designing benchmarking tasks and defining suitable performance metrics.
Here, we begin to address this challenge, and provide a benchmark framework for SBI to allow rapid and transparent comparisons of current and future SBI algorithms: First, we selected a set of initial algorithms representing distinct approaches to SBI (Fig. 1; cranmer2019). Second, we analyzed multiple performance metrics which have been used in the SBI literature. Third, we implemented ten tasks including tasks popular in the field. The shortcomings of commonly used metrics led us to focus on tasks for which a likelihood can be evaluated, which allowed us to calculate reference (‘ground-truth’) posteriors. These reference posteriors are made available to allow rapid evaluation of SBI algorithms. Code for the framework is available at github.com/sbi-benchmark/sbibm and we maintain an interactive version of all results at sbi-benchmark.github.io.
The full potential of the benchmark will be realized when it is populated with additional community-contributed algorithms and tasks. However, our initial version already provides useful insights: 1) the choice of performance metric is critical; 2) the performance of the algorithms on some tasks leaves substantial room for improvement; 3) sequential estimation generally improves sample efficiency; 4) for small and moderate simulation budgets, neural-network based approaches outperform classical ABC algorithms, confirming recent progress in the field; and 5) there is no algorithm to rule them all. The performance ranking of algorithms is task-dependent, pointing to a need for better guidance or automated procedures for choosing which algorithm to use when. We highlight examples of how the benchmark can be used to diagnose shortcomings of algorithms and facilitate improvements. We end with a discussion of the limitations of the benchmark.
The benchmark consists of a set of algorithms, performance metrics and tasks. Given a prior over parameters , a simulator to sample and an observation , an algorithm returns an approximate posterior , or samples from it, . The approximate solution is tested, according to a performance metric, against a reference posterior .
Following the classification introduced in the review by cranmer2019, we selected algorithms addressing SBI in four distinct ways, as schematically depicted in Fig. 1. An important difference between algorithms is how new simulations are acquired: Sequential algorithms adaptively choose informative simulations to increase sample efficiency. While crucial for expensive simulators, it can require non-trivial algorithmic steps and hyperparameter choices. To evaluate whether the potential is realized empirically and justifies the algorithmic burden, we included sequential and non-sequential counterparts for algorithms of each category.
Keeping our initial selection focused allowed us to carefully consider implementation details and hyperparameters: We extensively explored performance and sensitivity to different choices in more than 10k runs, all results and details of which can be found in Appendix H. Our selection is briefly described below, full algorithm details are in Appendix A.
REJ-ABC and SMC-ABC. Approximate Bayesian Computation (ABC, sisson2018_chapter1) is centered around the idea of Monte Carlo rejection sampling (tavere1997; pritchard1999). Parameters are sampled from a proposal distribution, simulation outcomes are compared with observed data , and are accepted or rejected depending on a (user-specified) distance function and rejection criterion. While rejection ABC (REJ-ABC) uses the prior as a proposal distribution, the efficiency can be improved by using sequentially refined proposal distributions (SMC-ABC, beaumont2002; marjoram2006; sisson2007; toni2009; beaumont2009). We implemented REJ-ABC with quantile-based rejection and used the scheme of beaumont2009 for SMC-ABC. We extensively varied hyperparameters and compared the implementation of an ABC-toolbox (klinger2018) against our own (Appendix H). We investigated linear regression adjustment (blum2010) and the summary statistics approach by prangle2014semi (Suppl. Fig. 1).
NLE and SNLE. Likelihood estimation (or ‘synthetic likelihood’) algorithms learn an approximation to the intractable likelihood, for an overview see sisson2018_chapter12. While early incarnations focused on Gaussian approximations (SL; wood2010), recent versions utilize deep neural networks (papamakarios2019a; lueckmann2019) to approximate a density over , followed by MCMC to obtain posterior samples. Since we primarily focused on these latter versions, we refer to them as neural likelihood estimation (NLE) algorithms, and denote the sequential variant with proposals as SNLE. In particular, we used the scheme proposed by papamakarios2019a which uses masked autoregressive flows (MAFs, papamakarios2017) for density estimation. We improved MCMC sampling for (S)NLE and compared MAFs against Neural Spline Flows (NSFs; durkan2019neural), see Appendix H.
NPE and SNPE. Instead of approximating the likelihood, these approaches directly target the posterior. Their origins date back to regression adjustment approaches (blum2010). Modern variants (papamakarios2016; lueckmann2017; greenberg2019) use neural networks for density estimation (approximating a density over ). Here, we used the recent algorithmic approach proposed by greenberg2019 for sequential acquisitions. We report performance using NSFs for density estimation, which outperformed MAFs (Appendix H).
NRE and SNRE. Ratio Estimation approaches to SBI use classifiers to approximate density ratios (izbicki2014high; pham2014note; cranmer2015; dutta2016likelihood; durkan2020; thomas2020). Here, we used the recent approach proposed by hermans2019 as implemented in durkan2020: A neural network-based classifier approximates probability ratios and MCMC is used to obtain samples from the posterior. SNRE denotes the sequential variant of neural ratio estimation (NRE). In Appendix H we compare different classifier architectures for (S)NRE.
In addition to these eight approaches, we benchmarked Random Forest ABC (RF-ABC; raynal2018), a recent ABC variant, and Synthetic Likelihood (SL; wood2010), mentioned above. However, RF-ABC only targets individual parameters (i.e. assumes posteriors to factorize), and SL requires new simulations for every MCMC step, thus requiring orders of magnitude more simulations than other algorithms. Therefore, we report results for these algorithms separately, in Suppl. Fig. 2 and Suppl. Fig. 3, respectively.
Algorithms can be grouped with respect to how their output is represented: 1) some return samples from the posterior, (REJ-ABC, SMC-ABC); 2) others return samples and allow evaluation of unnormalized posteriors ((S)NLE, (S)NRE); and 3) for some, the posterior density can be evaluated and sampled directly, without MCMC ((S)NPE). As discussed below, these properties constrain the metrics that can be used for comparison.
2.2 Performance metrics
Choice of a suitable performance metric is central to any benchmark. As the goal of SBI algorithms is to perform full inference, the ‘gold standard’ would be to quantify the similarity between the true posterior and the inferred one with a suitable distance (or divergence) measure on probability distributions. This would require both access to the ground-truth posterior, and a reliable means of estimating similarity between (potentially) richly structured distributions. Several performance metrics have been used in past research, depending on the constraints imposed by knowledge about ground-truth and the inference algorithm (see Table 1). In real-world applications, typically only the observation is known. However, in a benchmarking setting, it is reasonable to assume that one has at least access to the ground-truth parameters . There are two commonly used metrics which only require and , but suffer severe drawbacks for our purposes:
Probability . The negative log probability of true parameters averaged over different , , has been used extensively in the literature (papamakarios2016; durkan2018; greenberg2019; papamakarios2019a; durkan2020; hermans2019). Its appeal lies in the fact that one does not need access to the ground-truth posterior. However, using it only for a small set of is highly problematic: It is only a valid performance measure if averaged over a large set of observations sampled from the prior (talts2018, detailed discussion including connection to simulation-based calibration in Appendix M). For reliable results, one would require inference for hundreds of which is only feasible if inference is rapid (amortized) and the density can be evaluated directly (among the algorithms used here this applies only to NPE).
|1||1||1, 3||1, 3, 4||1, 3, 4|
|1||1||1, 3||1, 3, 4||1, 3, 4|
|1||1, 2||1, 2, 3||1, 2, 3, 4||1, 2, 3, 4, 5|
1 = PPCs, 2 = Probability , 3 = 2-sample tests, 4 = 1-sample tests, 5 = -divergences.
Posterior-Predictive Checks (PPCs). As the name implies, PPCs should be considered a mere check rather than a metric, although the median distance between predictive samples and has been reported in the SBI literature (papamakarios2019a; greenberg2019; durkan2020). A failure mode of such a metric is that an algorithm obtaining a good MAP point estimate, could perfectly pass this check even if the estimated posterior is poor. Empirically, we found median-distances (MEDDIST) to be in disagreement with other metrics (see 3 Results).
The shortcomings of these commonly-used metrics led us to focus on tasks for which it is possible to get samples from ground-truth posterior , thus allowing us to use metrics based on two-sample tests:
Maximum Mean Discrepancy (MMD). MMD (gretton2012; sutherland2017) is a kernel-based 2-sample test. Recent papers (papamakarios2019a; greenberg2019; hermans2019) reported MMD using translation-invariant Gaussian kernels with length scales determined by the median heuristic (ramdas2015). We empirically found that MMD can be sensitive to hyperparameter choices, in particular on posteriors with multiple modes and length scales (see 3 Results and liu2020).
Classifier 2-Sample Tests (C2ST). C2STs (friedman2004; lopez-paz2018) train a classifier to discriminate samples from the true and inferred posteriors, which makes them simple to apply and easy to interpret. Therefore, we prefer to report and compare algorithms in terms of accuracy in classification-based tests. In the context of SBI, C2ST has e.g. been used in gutmann2018likelihood; dalmasso2019a.
Other metrics that could be used include:
Kernelized Stein Discrepancy (KSD). KSD (liu2016; chwialkowski2016) is a 1-sample test, which require access to rather than samples from ( is the unnormalized posterior). Like MMD, current estimators use translation-invariant kernels.
-Divergences. Divergences such as Total Variation (TV) divergence and KL divergences can only be computed when the densities of true and approximate posteriors can be evaluated (Table 1). Thus, we did not use -divergences for the benchmark.
Full discussion and details of metrics in Appendix M.
The preceding considerations guided our selection of inference tasks: We focused on tasks for which reference posterior samples can be obtained, to allow calculation of 2-sample tests. We focused on eight purely statistical problems and two problems relevant in applied domains, with diverse dimensionalities of parameters and data (details in Appendix T):
Gaussian Linear/Gaussian Linear Uniform. We included two versions of simple, linear, 10-d Gaussian models, in which the parameter is the mean, and the covariance is fixed. The first version has a Gaussian (conjugate) prior, the second one a uniform prior. These tasks allow us to test how algorithms deal with trivial scaling of dimensionality, as well as truncated support.
SLCP/SLCP Distractors. A challenging inference task designed to have a simple likelihood and a complex posterior (papamakarios2019a; greenberg2019): The prior is uniform over five parameters and the data are a set of four two-dimensional points sampled from a Gaussian likelihood whose mean and variance are nonlinear functions of . This induces a complex posterior with four symmetrical modes and vertical cut-offs. We included a second version with 92 additional, non-informative output dimensions (distractors) to investigate the ability of SBI algorithms to detect informative features.
Bernoulli GLM/Bernoulli GLM Raw. 10-parameter Generalized Linear Model (GLM) with Bernoulli observations. Inference was either performed on sufficient statistics (10-d) or raw data (100-d).
Gaussian Mixture. This inference task, introduced by sisson2007, has become common in the ABC literature (beaumont2009; toni2009; simola2020). It consists of a mixture of two two-dimensional Gaussian distributions, one with much broader covariance than the other.
Two Moons. A two-dimensional task with a posterior that exhibits both global (bimodality) and local (crescent shape) structure (greenberg2019) to illustrate how algorithms deal with multimodality.
SIR. Dynamical systems represent paradigmatic use cases for SBI. SIR is an influential epidemiological model describing the dynamics of the number of individuals in three possible states: susceptible , infectious , and recovered or deceased, . We infer the contact rate and the mean recovery rate , given observed infection counts at 10 evenly-spaced time points.
Lotka-Volterra. An influential model in ecology describing the dynamics of two interacting species, widely used in SBI studies. We infer four parameters related to species interaction, given the number of individuals in both populations at 10 evenly-spaced points in time.
2.4 Experimental Setup
For each task, we sampled 10 sets of true parameters from the prior and generated corresponding observations . For each observation, we generated 10k samples from the reference posterior. Some reference posteriors required a customised (likelihood-based) approach (Appendix B).
In SBI, it is typically assumed that total computation cost is dominated by simulation time. We therefore report performance at different simulation budgets. For each observation, each algorithm was run with a simulation budget ranging from 1k to 100k simulations.
For each run, we calculated metrics described above. To estimate C2ST accuracy, we trained a multilayer perceptron to tell apart approximate and reference posterior samples and performed five-fold cross-validation. We used two hidden layers, each with 10 times as many ReLu units as the dimensionality of the data. We also measured and report runtimes (Appendix R).
Code. All code is released publicly at github.com/sbi-benchmark/sbibm. Our framework includes tasks, reference posteriors, metrics, plotting, and infrastructure tooling and is designed to be 1) easily extensible, 2) used with external toolboxes implementing algorithms. All tasks are implemented as probabilistic programs in Pyro (bingham2018), so that likelihoods and gradients for reference posteriors can be extracted automatically. To make this possible for tasks that use ODEs, we developed a new interface between DifferentialEquations.jl (rackauckas2017; bezanson2017julia) and PyTorch (paszke2019). In addition, specifying simulators in a probabilistic programming language has the advantage that ‘gray-box’ algorithms (brehmer2020a; cranmer2019) can be added in the future. We here evaluated algorithms implemented in pyABC (klinger2018), pyabcranger (collin2020), and sbi (sbi). See Appendix B for additional details on our code and existing SBI toolboxes.
Reproducibility. Instructions to reproduce experiments on cloud-based infrastructure are in Appendix B.
We first consider empirical results on a single task, Two Moons, according to different metrics, which illustrate the following important insight:
#1: Choice of performance metric is key. While C2ST results on Two Moons show that performance increases with higher simulation budgets and that sequential algorithms outperform non-sequential ones for low to medium budgets, these results were not reflected in MMD and MEDDIST (Fig. 2): In our analyses, we found MMD to be sensitive to hyperparameter choices, in particular on tasks with complex posterior structure. When using the commonly employed median heuristic to set the kernel length scale on a task with multi-modal posteriors (like Two Moons), MMD had difficulty discerning markedly different posteriors. This can be ‘fixed’ by using hyperparameters adapted to the task (Suppl. Fig. 4). As discussed above, the median distance (though commonly used) can be ‘gamed’ by a good point estimate even if the estimated posterior is poor and is thus not a suitable performance metric. Computation of KSD showed numerical problems on Two Moons, due to the gradient calculation.
We assessed relationships between metrics empirically via the correlations across tasks (Suppl. Fig. 5). As discussed above, the log-probability of ground-truth parameters can be problematic when averaged over too few observations (e.g., 10, as is common in the literature): indeed, this metric had a correlation of only 0.3 with C2ST on Two Moons and 0.6 on the SLCP task. Based on these considerations, we used C2ST for reporting performance (Fig. 3; results for MMD, KSD and median distance on the website).
Based on the comparison of the performance across all tasks, we highlight the following main points:
#2: These are not solved problems. C2ST uses an interpretable scale (1 to 0.5), which makes it possible to conclude that, for several tasks, no algorithm could solve them with the specified budget (e.g., SLCP, Lotka-Volterra). This highlights that our problems—though conceptually simple—are challenging, and there is room for development of more powerful algorithms.
#3: Sequential estimation improves sample efficiency. Our results show that sequential algorithms outperform non-sequential ones (Fig. 3). The difference was small on simple tasks (i.e. linear Gaussian cases), yet pronounced on most others. However, we also found these methods to exhibit diminishing returns as the simulation budget grows, which points to an opportunity for future improvements.
#4: Density or ratio estimation-based algorithms generally outperform classical techniques. REJ-ABC and SMC-ABC were generally outperformed by more recent techniques which use neural networks for density- or ratio-estimation, and which can therefore efficiently interpolate between different simulations (Fig. 3). Without such model-based interpolation, even a simple 10-d Gaussian task can be challenging. However, classical rejection-based methods have a computational footprint that is orders of magnitudes faster, as no network training is involved (Appendix R). Thus, on low-dimensional problems and for cheap simulators, these methods can still be competitive. See Suppl. Fig. 1 for results with additional ABC variants (blum2010; prangle2014semi) and Suppl. Fig. 2 for results on RF-ABC.
#5: No one algorithm to rule them all. Although sequential density or ratio estimation-based algorithms performed better than their non-sequential counterparts, there was no clear-cut answer as to which sequential method (SNLE, SNRE, and SNPE) should be preferred. To some degree, this is to be expected: these algorithms have distinct strengths that can play out differently depending on the problem structure (see discussions e.g., in greenberg2019; durkan2018; durkan2020). However, this has not been shown systematically before. We formulate some practical guidelines for choosing appropriate algorithms in Appendix P.
#6: The benchmark can be used to diagnose implementation issues and improve algorithms. For example, (S)NLE and (S)NRE rely on MCMC sampling to compute posteriors, and this sampling step can limit the performance. Access to a reference posterior can help identify and improve such issues: We found that single chains initialized by sampling from the prior with axis-aligned slice sampling (as introduced in papamakarios2019a) frequently got stuck in single modes. Based on this observation, we changed the MCMC strategy (details in Appendix A), which, though simple, yielded significant performance and speed improvements on the benchmark tasks. Similarly, (S)NLE and (S)NRE improved by transforming parameters to be unbounded: Without transformations, runs on some tasks can get stuck during MCMC sampling (e.g., Lotka-Volterra). While this is common advice for MCMC (hogg2017), it has been lacking in code and papers of SBI approaches.
We used the benchmark to systematically compare hyperparameters: For example, as density estimators for (S)NLE and (S)NPE, we used NSFs (durkan2020) which were developed after these algorithms were published. This revealed that higher capacity density estimators were beneficial for posterior but not likelihood estimation (detailed analysis in Appendix H).
These examples show how the benchmark makes it possible to diagnose problems and improve algorithms.
Our benchmark, in its current form, has several limitations. First, the algorithms considered here do not cover the entire spectrum of SBI algorithms: We did not include sequential algorithms using active learning or Bayesian Optimization (gutmann2015; jarvenpaa2019; lueckmann2019; aushev2020), or ‘gray-box’ algorithms, which use additional information about or from the simulator (e.g., baydin2019etalumis; brehmer2020a). We focused on approaches using neural networks for density estimation and did not compare to alternatives using Gaussian Processes (e.g., meeds2014; wilkinson2014). There are many other algorithms which the benchmark is currently lacking (e.g., nott2014; ong2018; clarte2019; prangle2019distilling; priddle2019; picchini2020; radev2020bayesflow; rodrigues2020). Keeping our initial selection small allowed us to carefully investigate hyperparameter choices. We focused on sequential algorithms with less sophisticated acquisition schemes and the black-box scenario, since we think these are important baselines for future comparisons.
Second, the tasks we considered do not cover the variety of possible challenges. Notably, while we have tasks with high dimensional data with and without structure, we have not included tasks with high-dimensional spatial structure, e.g., images. Such tasks would require algorithms that automatically learn summary statistics while exploring the structure of the data (e.g., dinev2018; greenberg2019; hermans2019; chen2020neural), an active research area.
Third, while we extensively investigated tuning choices and compared implementations, the results might nevertheless reflect our own areas of expertise.
Fourth, in line with common practice in SBI, results presented in the paper focused on performance as a function of the number of simulation calls. It is important to remember that differences in computation time can be substantial (see Appendix R): For example, (S)ABC was much faster than approaches requiring network training. Overall, sequential neural algorithms exhibited longest runtimes.
Fifth, for reasons described above, we focused on problems for which reference posteriors can be computed. This raises the question of how insights on these problems will generalize to ‘real-world’ simulators. Notably, even these simple problems already identify clear differences between, and limitations of, different SBI approaches. Since it is not possible to rigorously compare the performance of different algorithms directly on ‘real-world’ simulators due to the lack of appropriate metrics, we see the benchmark as a necessary stepping stone towards the development of (potentially automated) selection strategies for practical problems.
Sixth, in practice, the choice of algorithm can depend on aspects that are difficult to quantify: It will depend on the available information about a problem, the inference goal, and the speed of the simulator, among other considerations. We included some practical considerations and recommendations in Appendix P.
Finally, benchmarking is an important tool, but not an end in itself—for example, conceptually new ideas might initially not yield competitive results but only reveal their true value later. Conversely, ‘overfitting’ on benchmarks can lead to the illusion of progress, and result in an undue focus on small implementation details which might not generalize beyond it. It would certainly be possible to cheat on this benchmark: In particular, as the simulators are available, one could use samples (or even likelihoods) to excessively tune hyperparameters for each task. This would hardly transfer to practice where such tuning is usually impossible (lack of metrics and expensive simulators). Therefore, we carefully compared choices and selected hyperparameters performing best across tasks (Appendix H).
Quantitatively evaluating, comparing and improving algorithms through benchmarking is at the core of progress in machine learning. We here provided an initial benchmark for simulation-based inference. If used sensibly, it will be an important tool for clarifying and expediting progress in SBI. We hope that the current results on multiple widely-used algorithms already provide insights into the state of the field, assist researchers with algorithm development, and that our recommendations for practitioners will help them in selecting appropriate algorithms.
We believe that the full potential of the benchmark will be revealed as more researchers participate and contribute. To facilitate this process, and allow users to quickly explore and compare algorithms, we are providing precomputed reference posteriors, a website (sbi-benchmark.github.io), and open-source code (github.com/sbi-benchmark/sbibm).
We thank Álvaro Tejero-Cantero, Auguste Schulz, Conor Durkan, François Lanusse, Leandra White, Marcel Nonnenmacher, Michael Deistler, Pedro Rodrigues, Poornima Ramesh, Sören Becker and Theofanis Karaletsos for discussions and comments on the manuscript. In addition, J.-M.L. would like to thank the organisers and participants of the Likelihood-Free Inference Workshop hosted by the Simons Foundation for discussions, in particular, Danley Hsu, François Lanusse, George Papamakarios, Henri Pesonen, Joeri Hermans, Johann Brehmer, Kyle Cranmer, Owen Thomas and Umberto Simola. We also acknowledge and thank the Python (van1995python) and Julia (bezanson2017julia) communities for developing the tools enabling this work, including Altair (altair), DifferentialEquations.jl (rackauckas2017), Hydra (hydra), kernel-gof (kgof), igms (igms), NumPy (harris2020), pandas (pandas), pyABC (klinger2018), pyabcranger (collin2020), Pyro (bingham2018), PyTorch (paszke2019), sbi (sbi), Scikit-learn (scikit-learn), torch-two-sample (tts), and vega-lite (vega-lite).
This work was supported by the German Research Foundation (DFG; SFB 1233 PN 276693517, SFB 1089, SPP 2041, Germany’s Excellence Strategy – EXC number 2064/1 PN 390727645) and the German Federal Ministry of Education and Research (BMBF; project ’ADIMEM’, FKZ 01IS18052 A-D).
Appendix A Algorithms
a.1 Rejection Approximate Bayesian Computation (Rej-Abc)
Classical Approximate Bayesian Computation (ABC) is based on Monte Carlo rejection sampling (tavere1997; pritchard1999): In rejection ABC, the evaluation of the likelihood is replaced by a comparison between observed data and simulated data , based on a distance measure . Samples from the approximate posterior are obtained by collecting simulation parameters that result in simulated data that is close to the observed data.
More formally, given observed data , a prior over parameters of simulation-based model , a distance measure and an acceptance threshold , rejection ABC obtains parameter samples from the approximate posterior as outlined in Algorithm 1.
In theory, rejection ABC obtains samples from the true posterior in the limit and , where is the simulation budget. In practice, its accuracy depends on the trade-off between simulation budget and the rejection criterion . Rejection ABC suffers from the curse of dimensionality, i.e., with linear increase in the dimensionality of , an exponential increase in simulation budget is required to maintain accurate results.
For the benchmark, we did not use a fixed -threshold, but quantile-based rejection. Depending on the simulation budget (1k, 10k, 100k), we used a quantile of (0.1, 0.01, or, 0.001), so that REJ-ABC returned 100 samples with smallest distance to in each of these cases (see Appendix H for different hyperparameter choices). In order to compute metrics on 10k samples, we sampled from a KDE fitted on the accepted parameters (details about KDE resampling in Appendix H). REJ-ABC requires the choice of the distance measure : here we used the -norm.
a.2 Sequential Monte Carlo Approximate Bayesian Computation (Smc-Abc)
Sequential Monte Carlo Approximate Bayesian Computation (SMC-ABC) algorithms (beaumont2002; marjoram2006; sisson2007; toni2009) are an extension of the classical rejection ABC approach, inspired by importance sampling and sequential Monte Carlo sampling. Central to SMC-ABC is the idea to approach the final set of samples from the approximate posterior by constructing a series of intermediate sets of samples slowly approaching the final set through perturbations.
Several variants have been developed (e.g., sisson2007; beaumont2009; toni2009; simola2020). Here, we used the scheme ABC-PMC scheme of beaumont2009 and refer to it as SMC-ABC in the manuscript. More formally, the description of the ABC-PMC algorithm is as follows: Given observed data , a prior over parameters of a simulation-based model , a distance measure , a schedule of acceptance thresholds , and a kernel to perturb intermediate samples, weighted samples of the approximate posterior are obtained as described in Algorithm 2.
SMC-ABC can improve the sampling efficiency compared to REJ-ABC and avoids severe inefficiencies due to a mismatch between initial sampling and the target distribution. However, it comes with more hyperparameters that can require careful tuning to the problem at hand, e.g., the choice of distance measure, kernel, and -schedule. Like, REJ-ABC, SMC-ABC suffers from the curse of dimensionality.
For the benchmark, we considered the popular toolbox pyABC (klinger2018). Additionally, to fully understand the details of the SMC-ABC approach, we also implemented our own version. In the main paper we report results obtained with our implementation because it yielded slightly better results. A careful comparison of the two approaches, and the optimization of hyperparameters like -schedule, population size and perturbation kernel variance across different tasks are shown in Appendix H. After optimization, the crucial parameters of SMC-ABC were set to: -norm as distance metric, quantile-based epsilon decay with 0.2 quantile, population size 100 for simulation budgets 1k and 10k, population size 1000 for simulation budget 100k, Gaussian perturbation kernel with empirical covariance from previous population scaled by 0.5. We obtained 10k samples required for calculation of metrics as follows: If a population is not complete within the simulation budget we completed it with accepted particles from the last population and recalculated all weights. We then fitted a KDE on all those particles and sampled 10k samples from the KDE.
a.3 Neural Likelihood Estimation (Nle)
Likelihood estimation approaches to SBI use density estimation to approximate the likelihood . After learning a surrogate ( denoting the parameters of the estimator) for the likelihood function, one can for example use Markov Chain Monte Carlo (MCMC) based sampling algorithms to obtain samples from the approximate posterior . This idea dates back to using Gaussian approximations of the likelihood (wood2010; sisson2018_chapter12), and more recently, was extended to density estimation with neural networks (papamakarios2019a; lueckmann2019).
We refer to the single-round version of the (sequential) neural likelihood approach by papamakarios2019a as NLE, and outline it in Algorithm 3: Given a set of samples obtained by sampling from the prior and simulating , we train a conditional neural density estimator modelling the conditional of data given parameters on the set . Training proceeds by maximizing the log likelihood . Given enough simulations, a sufficiently flexible conditional neural density estimator approximates the likelihood in the support of the prior (papamakarios2019a). Once is trained, samples from the approximate posterior are obtained using MCMC sampling based on the approximate likelihood and the prior .
For MCMC sampling, papamakarios2019a suggest to use Slice Sampling (neal2003) with a single chain. However, we observed that the accuracy of the obtained posterior samples can be substantially improved by changing the Slice Sampling scheme as follows: 1) Instead of a single chain, we used 100 parallel MCMC chains; 2) for initialization of the chains, we sampled 10k candidate parameters from the prior, evaluated them under the unnormalized approximate posterior, and used these values as weights to resample initial locations; 3) we transformed parameters to be unbounded as suggested e.g. in bingham2018; carpenter2017; hogg2017. In addition, we reimplemented the slice sampler to allow vectorized evaluations of the likelihood, which yielded significant computational speed-ups.
For the benchmark, we used as density estimator a Masked Autoregressive Flow (MAF, papamakarios2017) with five flow transforms, each with two blocks and 50 hidden units, non-linearity and batch normalization after each layer. For the MCMC step, we used the scheme as outlined above with 200 warm-up steps and ten-fold thinning, to obtain 10k samples from the approximate posterior (1k samples from each chain). In Appendix H we show results for all tasks obtained with a Neural Spline Flow (NSF, durkan2019neural) for density estimation, using five flow transforms, two residual blocks of 50 hidden units each, ReLU non-linearity, and 10 bins.
a.4 Sequential Neural Likelihood Estimation (Snle)
Sequential Neural Likelihood estimation (SNLE or SNL, papamakarios2019a) extends the neural likelihood estimation approach described in the previous section to be sequential.
The idea behind sequential SBI algorithms is based on the following intuition: If for a particular inference problem, there is only a single one is interested in, then simulating data using parameters from the entire prior space might be inefficient, leading to a training set that contains training data which carries little information about the posterior . Instead, to increase sample efficiency, one may draw training data points from a proposal distribution , ideally obtaining for which is close to . One candidate that has been commonly used in the literature for such a proposal is the approximate posterior distribution itself.
SNLE is a multi-round version of NLE, where in each round new training samples are drawn from a proposal . The proposal is chosen to be the posterior estimate at from the previous round and its samples are obtained using MCMC. The proposal controls where is learned most accurately. Thus, by iterating over multiple rounds, a good approximation to the posterior can be learned more efficiently than by sampling all training data from the prior. SNLE is summarized in Algorithm 4.
For the benchmark, we used as density estimator a Masked Autoregressive Flow (papamakarios2017), and MCMC to obtain posterior samples after every round, both with the same settings as described for NLE. The simulation budget was equally split across 10 rounds. In Appendix H, we show results for all tasks obtained with a Neural Spline Flow (NSF, durkan2019neural) for density estimation, using five flow transforms, two residual blocks of 50 hidden units each, ReLU non-linearity, and 10 bins.
a.5 Neural Posterior Estimation (Npe)
NPE uses conditional density estimation to directly estimate the posterior. This idea dates back to regression adjustment approaches (blum2010) and was extended to density estimators using neural networks (papamakarios2016) more recently.
As outlined in Algorithm 5, the approach is as follows: Given a prior over parameters and a simulator, a set of training data points is generated. This training data is used to learn the parameters of a conditional density estimator using a neural network , i.e., . The loss function is given by the negative log probability . If the density estimator is flexible enough and training data is infinite, this loss function leads to perfect recovery of the ground-truth posterior (papamakarios2016).
For the benchmark, we used the approach by papamakarios2016 with a Neural Spline Flow (NSF, durkan2019neural) as density estimator, using five flow transforms, two residual blocks of 50 hidden units each, ReLU non-linearity, and 10 bins. We sampled 10k samples from the approximate posterior . In Appendix H, we compare NSFs to Masked Autoregressive Flows (MAFs, papamakarios2017), as used in greenberg2019; durkan2020, with five flow transforms, each with two blocks and 50 hidden units, non-linearity and batch normalization after each layer.
a.6 Sequential Neural Posterior Estimation (Snpe)
Sequential Neural Posterior Estimation SNPE is the sequential analog of NPE, and meant to increase sample efficiency (see also subsection A.4). When the posterior is targeted directly, using a proposal distribution different from the prior requires a correction step—without it, the posterior under the proposal distribution would be inferred (papamakarios2016). This so-called proposal posterior is denoted by :
where . Note that for , it directly follows that .
There have been three different approaches to this correction step so far, leading to three versions of SNPE (papamakarios2016; lueckmann2017; greenberg2019). All three algorithms have in common that they train a neural network to learn the parameters of a family of densities to estimate the posterior. They differ in what is targeted by and which loss is used for .
SNPE-A (papamakarios2016) trains to target the proposal posterior by minimizing the log likelihood loss , and then post-hoc solves for . The analytical post-hoc step places restrictions on , the proposal, and prior. papamakarios2016 used Gaussian mixture density networks, single Gaussians proposals, and Gaussian or uniform priors. SNPE-B (lueckmann2017) trains with the importance weighted loss to directly recover without the need for post-hoc correction, removing restrictions with respect to , the proposal, and prior. However, the importance weights can have high variance during training, leading to inaccurate inference for some tasks (greenberg2019). SNPE-C (APT) (greenberg2019) alleviates this issue by reparameterizing the problem such that it can infer the posterior by maximizing an estimated proposal posterior. It trains to approximate with , using a loss defined on the approximate proposal posterior . greenberg2019 introduce ‘atomic’ proposals to allow for arbitrary choices of the density estimator, e.g., flows (papamakarios2019c): The loss on is calculated as the expectation over proposal sets sampled from a so-called ‘hyperproposal’ as outlined in Algorithm 6 (see greenberg2019, for full details).
For the benchmark, we used the approach by greenberg2019 with ‘atomic’ proposals and referred to it as SNPE. As density estimator, we used a Neural Spline Flow (durkan2019neural) with the same settings as for NPE. For the ‘atomic’ proposals, we used atoms (larger was too demanding in terms of memory). The simulation budget was equally split across 10 rounds and for the final round, we obtained 10k samples from the approximate posterior . In Appendix H, we compare NSFs to Masked Autoregressive Flows (MAFs, papamakarios2017), as used in greenberg2019; durkan2020, with five flow transforms, each with two blocks and 50 hidden units, non-linearity and batch normalization after each layer.
a.7 Neural Ratio Estimation (Nre)
Neural ratio estimation (NRE) uses neural-network based classifiers to approximate the posterior . While neural-network based approaches described in the previous sections use density estimation to either estimate the likelihood ((S)NLE) or the posterior ((S)NPE), NRE algorithms ((S)NRE) use classification to estimate a ratio of likelihoods. The ratio can then be used for posterior evaluation or MCMC-based sampling.
Likelihood ratio estimation can be used for SBI because it allows to perform MCMC without evaluating the intractable likelihood. In MCMC, the transition probability from a current parameter to a proposed parameter depends on the posterior ratio and in turn on the likelihood ratio between the two parameters:
Therefore, given a ratio estimator learned from simulations, one can perform MCMC to obtain samples from the posterior, even if evaluating is intractable.
hermans2019 proposed the following approach for MCMC with classifiers to approximate density ratios: A classifier is trained to distinguish samples from an arbitrary and samples from the marginal model . This results in a likelihood-to-evidence estimator that needs to be trained only once to be evaluated for any . The training of the classifier proceeds by minimizing the binary cross-entropy loss (BCE), as outlined in Algorithm 7. Once the classifier is parameterized, it can be used to perform MCMC to obtain samples from the posterior. The authors name their approach Amortized Approximate Likelihood Ratio MCMC (AALR-MCMC): It is amortized because once the likelihood ratio estimator is trained, it is possible to run MCMC for any .
Earlier ratio estimation algorithms for SBI (e.g., izbicki2014high; pham2014note; cranmer2015; dutta2016likelihood) and their connections to recent methods are discussed in thomas2020, as well as in durkan2020. AALR-MCMC is closely related to LFIRE (dutta2016likelihood) but trains an amortized classifier rather than a separate one per posterior evaluation. durkan2020 showed that the loss of AALR-MCMC is closely related to the atomic SNPE-C/APT approach of greenberg2019 (SNPE) and that both can be combined in a unified framework. durkan2020 changed the formulation of the loss function for training the classifier from binary to multi-class.
For the benchmark, we used neural ratio estimation (NRE) as formulated by durkan2020 and implemented in the sbi toolbox (sbi). As a classifier, we used a residual network architecture (ResNet) with two hidden layers of 50 units and ReLU non-linearity, trained with Adam (kingma2014adam). Following the notation of durkan2020, we used as the size of the contrasting set. For the MCMC step, we followed the same procedure as described for NLE, i.e., using Slice Sampling with 100 chains, to obtain 10k samples from each approximate posterior. In Appendix H, we show results for all tasks obtained with a multi-layer perceptron (MLP) architecture with two hidden layers of 50 ReLu units, and batch normalization.
a.8 Sequential Neural Ratio Estimation (Snre)
Sequential Neural Ratio Estimation (SNRE) is the sequential version of NRE, and meant to increase sample efficiency, at the cost of needing to train new classifiers for different .
A sequential version of neural ratio estimation was proposed by hermans2019. As with other sequential algorithms, the idea is to replace the prior by a proposal distribution that is focused on in the sense that the sampled parameters result in simulated data that are informative about . The proposal for the next round is the posterior estimate from the previous round. The ratio estimator then becomes and is refined over rounds by training the underlying classifier with positive examples and negative examples . Exact posterior evaluation is not possible anymore, but samples can be obtained as before via MCMC. These steps are outlined in Algorithm 8.
For the benchmark, we used SNRE as formulated by durkan2020 and implemented in the sbi toolbox (sbi). The classifier had the same architecture as described for NRE. For the MCMC step, we followed the same procedure as described for NLE. The simulation budget was equally split across 10 rounds. In Appendix H, we show results for all tasks obtained with a multi-layer perceptron (MLP) architecture with two hidden layers of 50 ReLu units, and batch normalization.
a.9 Random Forest Approximate Bayesian Computation (Rf-Abc)
Random forest Approximate Bayesian Computation (RF-ABC, pudlo2016; raynal2018) is a more recently developed ABC algorithm based on a regression approach. Similar to previous regression approaches to ABC (beaumont2002; blum2010), RF-ABC aims at improving classical ABC inference (REJ-ABC, SMC-ABC) in the setting of high-dimensional data.
The idea of the RF-ABC algorithm is to use random forests (RF, breiman2001) to run a non-parametric regression of a set of potential summary statistics of the data on the corresponding parameters. That is, the RF regression is trained on data simulated from the model, such that the covariates are the summary statistics and the response variable is a parameter. For a detailed description of the algorithm, we refer to raynal2018.
The only hyperparameters for the RF-ABC algorithm are the number of trees and the minimum node size for the RF regression. Following raynal2018, we chose the default of 500 trees and a minimum of 5 nodes. The output of the algorithm is a RF weight for each of the simulated parameters. This set of weights can be used to calculate posterior quantiles or to obtain an approximate posterior density as described in raynal2018. We obtained 10k posterior samples for the benchmark by using the random forest weights to sample from the simulated parameters. We used the implementation in the abcranger toolbox collin2020.
One important property of RF-ABC is that it can only be applied in the unidimensional setting, i.e., for 1-D dimensional parameter spaces, or for multidimensional parameters spaces with the assumption that the posterior factorizes over parameters (thus ignoring potential posterior correlations). This assumptions holds only for a few tasks in our benchmark (Gaussian Linear, Gaussian Linear Uniform, Gaussian Mixture). Due to this inherent limitation, we report RF-ABC in the supplement (see Suppl. Fig. 2).
a.10 Synthetic Likelihood (Sl)
The Synthetic Likelihood (SL) approach circumvents the evaluation of the intractable likelihood by estimating a synthetic one from simulated data or summary statistics. This approach was introduced by wood2010. Its main motivation is that the classical ABC approach of comparing simulated and observed data with a distance metric can be problematic if parts of the differences are entirely noise-driven. wood2010 instead approximated the distribution of the summary statistics (the likelihood) of a nonlinear ecological dynamic system as a Gaussian distribution, thereby capturing the underlying noise as well. The approximation of the likelihood can then be used to obtain posterior sampling via Markov Chain Monte Carlo (MCMC) (wood2010).
The SL approach can be seen as the predecessor of the (S)NLE approaches: They replaced the Gaussian approximation of the likelihood with a much more flexible one that uses neural networks and normalizing flows (see A.3). Moreover, there are modern approaches from the classical ABC field that further developed SL using a Gaussian approximation (e.g., sisson2018_chapter12; priddle2019).
For the benchmark, we implemented our own version of the algorithm proposed by wood2010. We used Slice Sampling MCMC (neal2003) and estimated the Gaussian likelihood from 100 samples at each sampling step. To ensure a positive definite covariance matrix, we added a small value to the diagonal of the estimated covariance matrix for some of the tasks. In particular, we used for SIR and Bernoulli GLM Raw tasks, and we tried without success for Lotka-Volterra and SLCP with distractors. For all remaining tasks, we set . For Slice Sampling, we used a single chain initialized with sequential importance sampling (SIR) as described for NLE, 1k warm-up steps and no thinning, in order to keep the number of required simulations tractable. This resulted in an overall simulation budget on the order of to simulations per run in order to generate 10k posterior samples, as new simulations are required for every MCMC step.
The high simulation budget makes it problematic to directly compare SL and other other algorithms in the benchmark. Therefore, we report SL in the supplement (see Suppl. Fig. 3).
Appendix B Benchmark
b.1 Reference posteriors
We generated 10k reference posterior samples for each observation. For the T.1 Gaussian Linear task, reference samples were obtained by using the analytic solution for the true posterior. Similarly, for T.2 Gaussian Linear Uniform and T.7 Gaussian Mixture, the analytic solution was used, combined with an additional rejection step, in order to account for the bounded support of the posterior due to the use of a uniform prior. For the T.8 Two Moons task, we devised a custom scheme based on the model equations, which samples both modes and rejects samples outside the prior bounds.
For T.3 SLCP, T.9 SIR, and T.10 Lotka-Volterra, we devised a likelihood-based procedure to ensure obtaining a valid set of reference posterior samples: First, we either used Sampling/Importance Resampling (rubin1988using) (for T.3 SLCP, T.9 SIR) or Slice Sampling MCMC (neal2003) (for T.10 Lotka-Volterra) to obtain a set of 10k proposal samples from the unnormalized posterior . We used these proposal samples to train a density estimator, for which we used a neural spline flow (NSF) (durkan2019neural). Next, we created a mixture composed of the NSF and the prior with weights 0.9 and 0.1, respectively, as a proposal distribution for rejection sampling (martino2018accept). Rejection sampling relies on finding a constant such that for all values of : To find this constant, we initialized , sampled , and updated if . This loop stopped only after at least 100k samples without updating were reached. We then used , , and to generate 10k reference posterior samples. We found that the NSF-based proposal distribution resulted in high acceptance rates. We used this custom scheme rather than relying on MCMC directly, since we found that standard MCMC approaches (Slice Sampling, HMC, and NUTS) all struggled with multi-modal posteriors and wanted to avoid bias in the reference samples, e.g. due to correlations in MCMC chains.
As a sanity check, we ran this scheme twice on all tasks and observation and found that the resulting reference posterior samples were indistinguishable in terms of C2ST.
We provide sbibm, a benchmarking framework that implements all tasks, reference posteriors, different metrics and tooling to run and analyse benchmark results at scale. The framework is available at:
We make benchmarking new algorithms maximally easy by providing an open, modular framework for integration with SBI toolboxes. We here evaluated algorithms implemented in pyABC (klinger2018), pyabcranger (collin2020), and sbi (sbi). We emphasize that the goal of sbibm is orthogonal to any toolbox: It could easily be used with other toolboxes, or even be used to compare results for the same algorithm implemented by different ones. There are currently several SBI toolboxes available or under active development. elfi (elfi2018) is a general purpose toolbox, including ABC algorithms as well as BOLFI (gutmann2015). There are many toolboxes for ABC algorithms, e.g., abcpy (abcpy-repo), astroABC (jennings2017astroabc), CosmoABC (ishida2015cosmoabc), see also sisson2018_chapter13 for an overview. carl (louppe2016) implements the algorithm by cranmer2015. hypothesis (hypothesis-repo), and pydelfi (pydelfi-repo) are SBI toolboxes under development.
To ensure reproducibility of our results, we publicly released all code including instructions on how to run the benchmark on cloud-based infrastructure.
Appendix F Figures
Appendix H Hyperparameter Choices
In this section, we address two central questions for any benchmark: (1) how hyperparameters are chosen and (2) how sensitive results are to the respective choices.
Rather than tuning hyperparameters on a per-task basis, we changed hyperparameters on multiple or all tasks at once and selected configurations that worked best across tasks. We wanted to avoid overfitting on individual benchmark tasks and were instead interested in settings that can generalize across multiple tasks. In practice, tuning an algorithm on a given task would typically be impossible, due to the lack of suitable metrics that can be computed without reference posteriors as well as high computational demands that SBI tasks often have.
To find good general settings, we performed more than 10 000 individual runs. We explored hyperparameter choices that have not been previously reported, and revealed substantial improvements. The benchmark offers the possibility to systematically compare different choices and design better and more robust SBI algorithms.
Classical ABC algorithms have crucial hyperparameters, most importantly, the distance metric and acceptance tolerance . We used our own implementation of REJ-ABC as it is straightforward to implement (see A.1). The distance metric was fixed to be the -norm for all tasks and we varied different acceptance tolerances across tasks on which REJ-ABC performed sufficiently well. Our implementation of REJ-ABC is quantile based, i.e,. we select a quantile of the samples with the smallest distance to the observed data, which implicitly defines an . The 10k samples needed for the comparison to the reference posterior samples are then resampled from the selected samples. In order to check whether this resampling significantly impaired performance, we alternatively fit a KDE in order to obtain 10k samples.
Below, we show results for different schedules of quantiles for each simulation budget, e.g., a schedule of 0.1, 0.01, 0.001 corresponds to the 10, 1 and 0.1 percent quantile, or the top 100 samples for each simulation budget. Across tasks and budgets the 0.1, 0.01, 0.001 quantile schedule performed best (Fig. 6). Performance showed improvement by the KDE fit, especially on the Gaussian tasks. We therefore report the version using the top 100 samples and KDE in the main paper.
SMC-ABC has several hyperparameters including the population size, the perturbation kernel, the epsilon schedule and the distance metric. In order to ensure that we report the best possible SMC-ABC results for a fair comparison, we sweeped over three hyperparameters that are especially critical: the population size, the quantile used to select the epsilon from the distances of the particles of the previous population, and the scaling factor of the covariance of the Gaussian perturbation kernel. The remaining hyperparameters were fixed to values common in the literature: Gaussian perturbation kernel and -norm distance metric.
Additionally, we compared our implementation against one from the popular pyABC toolbox (klinger2018) to which we refer as versions A and B respectively. We sweeped over these hyperparameters and optionally added a post-hoc KDE fit for drawing the samples needed for two-sample based performance metrics.
Overall, the parameter setting with a population size of 100, a kernel covariance scale of 0.5, and an epsilon quantile 0.2 performed best. Although the results of the two different implementations were qualitatively very similar (compare Fig. 7 and Fig. 8, respectively), version A was slightly better on the Gaussian tasks. Although we tried to match the implementations and the exact settings, there are small differences between the two, which might explain the difference in the results: Implementation B constructs the Gaussian perturbation kernel using kernel density estimation on the weighted samples of the previous population, whereas A constructs it using the mean and covariance estimated from samples from the previous population. The latter could be advantageous in case of a Gaussian-like (high-dimensional) posterior (Gaussian Mixture and Gaussian linear task) and disadvantageous in a non-Gaussian-like posteriors (e.g., Two Moons). We decided to report results for SMC-ABC in the main paper using implementation A (ours) with population size 100 for simulation budgets 1k and 10k, and population size 1000 for simulation budget 100k, a kernel covariance scale of 0.5, and epsilon quantile 0.2. This choice of kernel covariance scale is different from recommendations in the literature (sisson2007; beaumont2009). We only found very small performance differences for different scales and note that our choice is in line with the recommendation of the pyABC toolbox (pyabcKernel), i.e., using a scale between 0 and 1. Performance showed improvement by the KDE fit, especially on the Gaussian tasks. We therefore report the version with KDE in the main paper.
h.3 MCMC for (S)Nle and (S)Nre
(S)NLE and (S)NRE both rely on MCMC sampling, which has several hyperparameters. In line with papamakarios2019a and durkan2020, we used Slice Sampling (neal2003). However, we modified the MCMC schemes used in these papers and obtained significant improvements in performance and speed.
Number of chains and initialization. While papamakarios2019a; durkan2020 used a single chain with axis-aligned updates, we found that on tasks with multi-modal posteriors, it can be essential to run multiple MCMC chains in order to sample all modes. Performance on Two Moons, for example, was poor with a single chain, since usually only one of the crescent shapes was sampled. Rather than initialising chains by drawing initial locations from the prior, we found the resampling scheme as described in A.3 to work better for initialisation, and used 100 chains instead of a single one.
Transformation of variables. When implementing MCMC, it is common advice to transform problems to have unbounded support (hogg2017), although this has not been discussed in SBI papers or implemented in accompanying code. We found that without this transformation, MCMC sampling could get stuck in endless loops, e.g., on the Lotka-Volterra task. Apart from the transformation to unbounded space, we found z-scoring of parameters and data to be crucial for some tasks.
Vectorization of MCMC sampling. We reimplemented Slice Sampling so that all chains could perform likelihood evaluations in parallel. Evaluating likelihoods, e.g., in the case of (S)NLE, requires passes through a flow-based density estimator, which is significantly faster when batched. This allowed us to sample all chains in parallel rather than sequentially and yielded huge speed-ups: For example, SNLE on Gaussian Linear took more than 36 hours on average for 100k simulations without vectorization, and less than 2 hours with vectorization.
h.4 Density estimator for (S)Nle
Approaches based on neural networks (NN) tend to have many hyperparameters, including the concrete type of NN architecture and hyperparameters for training. We strove to keep our choices close to durkan2020, which are the defaults in the toolbox we used (sbi, sbi).
While papamakarios2019a; durkan2020 used Masked Autoregressive Flows (MAFs, papamakarios2017) for density estimation, we explored how results change when using Neural Spline Flows (NSFs, durkan2019neural) for density estimation. These results are shown in Fig. 9.
h.5 Density estimator for (S)Npe
We performed the analogous experiments for (S)NPE as for (S)NLE: Here, we found NSFs to increase performance relative to MAFs (Fig. 10). When directly estimating the posterior distribution, especially on tasks with complex multi-modal structure like Two Moons or SLCP, the additional flexibility offered by NSFs improved performance. With NSFs, artifacts from density transformation that were visible e.g. in Two Moons posteriors, vanished. To our knowledge, results on (S)NPE with NSFs have not been previously published.
h.6 Classifier choice for (S)Nre
For (S)NRE, we compared two different choices of classifier architectures: an MLP and a ResNet architecture, as described in A.7. While results were similar for most tasks (Fig. 11), we decided to use the ResNet architecture in the main paper due to the better performance on Two Moons and SIR for low to medium simulation budgets.
Appendix M Metrics
m.1 Negative log probability of (Nltp)
In simulation-based inference, the average negative log likelihood of true parameters (NLTP) is commonly reported as a performance metric in the literature (papamakarios2016; durkan2018; papamakarios2019a; greenberg2019; hermans2019; durkan2020). An attractive property of this metric is that the access to the ground-truth posterior is not required.
It is important to point out, however, that calculating this metric on a single or small number of pairs is problematic. To illustrate the issue, consider the following example (as discussed in talts2018): Consider , and a single pair with and an implausible (but possible) . In this case, the true posterior is under which the has low probability since it is more than two standard deviations away from the posterior mean. If an algorithm fitted a wrong posterior, e.g., by overestimating the standard deviation as 1 instead of 0.5, the probability of under the estimated posterior would be higher than under the true posterior.
Therefore, a large number of pairs should be used. Indeed, in the limit of infinite number of pairs , the metric converges to a :
The first term in the final equation is the average between true and approximate posteriors over all observations that can be generated when sampling parameters from the prior. The second term, the entropy term, would be the same for all algorithms compared.
In the context of this benchmark, we decided against using the probability of as a metric: For all algorithms that are not amortized (all but one), evaluating posteriors at different would require rerunning inference. As the computational requirements for running the benchmark at 10 observations per task are already high, running tasks for hundreds of observations would become prohibitively expensive.
m.2 Simulation-based calibration (SBC)
In simulation-based calibration (SBC), samples are drawn from the data-averaged posterior, i.e., the posterior obtained by running inference for many observations. When the posterior approximation is exact, is distributed according to the prior (talts2018).
Let us briefly illustrate this: In SBC, we draw , which implies a joint distribution . The marginal is then:
If the approximate posterior is the true posterior, the marginal on is equal to the prior: If , then , i.e., one can set up a consistency test that is based on the distribution of samples. talts2018 do this by using frequentist tests per dimension.
Note that SBC as described above is merely a consistency check. For example, if the approximate posterior were the prior, a calibration test as described above would not be able to detect this. This is a realistic failure mode in simulation-based inference. It could happen with rejection ABC in the limit , or when learned summary statistics have no information about . One way around this is issue is proposed in prangle2014diag, who propose to restrict observations to a subset of all possible .
SBC is similar to the average negative log likelihood of true parameters described above, in that inference needs to be carried out for many observations generated by sampling from the prior. Running inference for hundreds of observations would become prohibitively expensive in terms of compute for most algorithms, which is why we do not rely on SBC in the benchmark.
m.3 Median distance (MEDDIST)
Posterior predictive checks (PPCs) use the posterior predictive distribution to predict new data, . The observed data should look plausible under the posterior predictive distribution (gelman2004, chapter 6). A particular PPC, used for example in papamakarios2019a; greenberg2019; durkan2020, is to assess the median L2 distance between posterior predictive samples and . The median is used since the mean would be more sensitive to outliers.
In the benchmark, we refer to this metric as median distance (MEDDIST) and drew samples from each posterior predictive distribution to compute it. In contrast with other metrics considered here, the median distance is computed in the space of data and requires additional simulations (which could be expensive, depending on the simulator). The median distance should be considered a mere check rather than a metric and it does not necessarily test the structure of the estimated posterior.
m.4 Maximum Mean Discrepancy (MMD)
Maximum Mean Discrepancy (MMD) is an Integral Probability Metric (IPM). Linear and quadratic time estimates for using MMD as a two-sample test were derived in gretton2012. MMD has been commonly used in the SBI literature with Gaussian kernels (papamakarios2019a; greenberg2019; hermans2019), setting a single length-scale hyperparameter by using a median heuristic (ramdas2015). We follow the same procedure, i.e., use Gaussian kernels with length-scale determined by the median heuristic on reference samples. MMDs are calculated using 10k samples from reference and approximate posteriors.
If simple kernels are used to compare distributions with complex, multimodal structure, distinct distributions can be mapped to nearby mean embeddings, resulting in low test power. On SLCP and Two Moons, for example, we found a translation-invariant kernel to be limiting, since it cannot adapt to the local structure (see Suppl. Fig. 4). This is reflected in the low correlation of MMD and C2ST (Suppl. Fig. 5). We emphasize that these issues are strictly related to simple kernels with hyperparameters commonly used in the literature. Posteriors of the Two Moons task have a structure similar to the blobs example of liu2020, who argue for using learned kernels to overcome the aforementioned problem.
m.5 Classifier-based tests (C2ST)
In classifier-based testing, a classifier is trained to distinguish samples of the true posterior from samples of the estimated posterior . If the samples are indistinguishable, the classification performance should be at chance level, 0.5. Practical use and properties of classifier-based 2-sample testing (C2ST) are discussed in lopez-paz2018 (see gutmann2018likelihood; dalmasso2019a, for examples in the context of SBI).
To compute C2ST, we trained a two-layer neural network with 10 times as many ReLU units as the dimensionality of parameters, and optimize with Adam (kingma2014adam). Classifiers were trained on 10k z-scored samples from reference and approximate posterior each. Classification accuracy was reported using 5-fold cross-validation.
m.6 Kernelized Stein Discrepancy (KSD)
Kernelized Stein Discrepancy (KSD) is a 1-sample goodness-of-fit test proposed independently by chwialkowski2016 and liu2016. KSD tests samples from algorithms against the gradient of unnormalized true posterior density, . We used KSD with Gaussian kernels, setting the length-scale through the median heuristic, and 10k samples from each algorithm.
Appendix P Practical Advice
Based on our current results and understanding, we provide advice to practioners seeking to apply SBI. Note that our benchmark contains an initial selection out of a wide variety of SBI algorithms, see e.g. the recent review by cranmer2019 for an overview and additional advice. There is no one-fits-all solution—which algorithm to use in practice will depend on the problem at hand. With this advice, we seek to provide a few useful questions to help identify which classes of algorithms may be applicable.
Do we need the Bayesian posterior, or is a point estimate sufficient?
Our focus here was on SBI algorithms that target the Bayesian posterior distribution, that is the set of parameters consistent with both the data and the prior distribution. If one only aims for a single estimate, optimization methods might be more efficient.
Is the simulator really ‘black-box’?
The SBI algorithms presented in the benchmark can be applied to any ‘black-box’ simulator (access to likelihoods, access to internal number generators, and differentiability are not required). If additional information is available, it is worth using algorithms that can leverage it. For example, if the likelihood is available, methods exploiting it (e.g. MCMC, variational inference) will generally be more efficient. Similarly, if one has access to the internal random numbers, probabilistic programming approaches such as inference compilation (le2016inference; baydin2019etalumis; wood2020) might be preferable. If additional quantities that characterize the latent process are available, i.e., the simulator is ‘gray-box’, they can be used to augment training data and improve inference (brehmer2020a; cranmer2019).
What domain knowledge do we have about the problem?
For any practical application of SBI, it is worth thinking carefully about domain knowledge. First, knowledge about plausible parameters should inform the choice of the prior (e.g., positive-valued parameters in SIR and Lotka-Volterra tasks). Second, domain knowledge can help design appropriate distance functions required, for example, for classical ABC algorithms. When using model-based approaches, domain knowledge can potentially be built into the SBI algorithm itself, for example, by incorporating neural network layers with appropriate inductive biases, or hard-coding known invariances into the architecture. Finally, domain knowledge may also guide the design of summary statistics.
Do we have, or can we learn summary statistics?
Summary statistics are especially important when facing problems with high-dimensional data: It is important to point out that the posterior given summary statistics is only equivalent to if the summary statistics are sufficient. The problem at hand can guide the manual design of summary statistics that are regarded particularly important or informative. Alternatively, many automatic approaches to construct summary features exist (e.g., prangle2014semi; charnock2018; dinev2018), and this is an active area of research. For instance, chen2020neural recently proposed a new approach to learn approximately sufficient statistics for (S)ABC and (S)NLE approaches and discuss that many existing approaches do not guarantee (global) sufficiency for all . When using (S)NPE or (S)NRE, high-dimensional data can be reduced as part of the network architecture.
Do we have low-dimensional data and parameters, and a cheap simulator?
If both the parameters and the data (or suitable summary-statistics thereof) are low-dimensional, and a very large number of simulations can be generated, model-free algorithms such as classical ABC algorithms can be competitive. These have the benefit of adding little computational overhead, especially compared to neural network-based approaches.
Are the data or the parameters higher dimensional?
For limited simulation budget, and/or higher dimensionalities, approaches that train a model of the likelihood, posterior, or likelihood ratio, will generally be preferable over model-free algorithms such as (S)ABC. Overall, we did not find strong differences between the these three model-based approaches in terms of estimation-accuracy across the current benchmark tasks. For tasks in which the data dimensionality was much higher than the dimensionality of parameters, we found algorithms that can automatically learn informative summary statistics of the data, e.g., (S)NPE or (S)NRE to be more efficient. Conversely, if the data are of lower dimensionality, likelihood-targeting methods (e.g. (S)NLE) might perform better.
Are simulations expensive? Can we simulate online?
For time-intensive and complex simulators, it can be beneficial to use sequential methods to increase sample efficiency: We found that sequential schemes generally outperformed non-sequential ones. While we focused on simple strategies which use the previous estimate of the posterior to propose new parameters, more sophisticated schemes (e.g., gutmann2015; lueckmann2019; jarvenpaa2019) may increase sample efficiency if only few simulations can be obtained. For some applications, inference is performed on a fixed dataset, and one cannot resort to sequential algorithms.
Do we want to do inference once, or repeatedly?
If we want to do SBI separately for many different data points (i.e. compute , methods that allow ‘amortization’ (NPE) are likely preferable. While NLE and NRE allow amortisation of the neural network, MCMC sampling is required, which takes additional time. Conversely, if we want to run SBI conditioned on many i.i.d. data (e.g. ) methods based on likelihood estimation (NLE), likelihood-ratio estimation (NRE), or NPE with exchangeable neural networks (chan2018) would be appropriate.
Appendix R Runtimes
In applications of SBI, simulations are commonly assumed to be the dominant cost. In order to make the benchmark feasible at this scale, we focused on simple simulators and optimized runtimes, e.g. we developed a new package bridging DifferentialEquations.jl (rackauckas2017; bezanson2017julia) and PyTorch (paszke2019) so that generating simulations for all implemented tasks is extremely fast. This differs from many cases in practice, where the runtime costs for an algorithm are often negligible compared to the cost of simulations. Having said that, algorithms show significant differences in runtime costs, which we measured and report here.
We recorded runtimes for all algorithms on all tasks. In principle, runtimes could be reduced by employing multi-CPU architectures, however, we decided for the single CPU setup to accurately compare runtimes across all algorithms and tasks. We did not employ GPUs for training neural-networks (NN). This is because the type of NNs used in the algorithms currently in the benchmark do not benefit much from GPU versus CPU training (e.g., no CNN architecture, rather shallow and narrow networks). In fact, running SNPE on SLCP using a GeForce GTX 1080 showed slightly longer runtimes than on CPU, due to the added overhead resulting from copying data back and forth to the device. Therefore, it was more economical and comparable to run the benchmark on CPUs.
All neural network-based algorithms were run on single 3.6 GHz CPU cores of AWS C5-instances. ABC algorithms were run on single CPU cores of an internal cluster with 2.4 GHz CPUs. We observed a difference in runtimes of less than 100ms when running ABC algorithms on the same hardware as used for neural network-based algorithms.
Figure 12 shows the recorded runtimes in minutes. We observed short runtimes for REJ-ABC and SMC-ABC, as these do not require NN training or MCMC. The sequential versions of all three NN-based algorithms yielded longer runtimes than the non-sequential versions because these involve 10 rounds of NN training. Among the sequential algorithms, SNPE showed the longest runtimes. Runtimes with MAFs instead of NSFs tend to be faster, e.g. the difference between MAFs and NSFs using SNPE on SLCP at 100k simulations was about 50 minutes on average. We also emphasize that the speed of (S)NLE reported here was only obtained after vectorizing MCMC sampling. Without vectorization, runtime on the Gaussian Linear for SNLE was more than 36 hours instead of less than 2 hours (see Appendix H).
Appendix T Tasks
t.1 Gaussian Linear
Inference of the mean of a 10-d Gaussian model, in which the covariance is fixed. The (conjugate) prior is Gaussian:
t.2 Gaussian Linear Uniform
Inference of the mean of a 10-d Gaussian model, in which the covariance is fixed. The prior is uniform:
A challenging inference task designed to have a simple likelihood and a complex posterior. The prior is uniform over five parameters and the data are a set of four two-dimensional points sampled from a Gaussian likelihood whose mean and variance are nonlinear functions of :
t.4 SLCP with Distractors
This task is similar to T.3, with the difference that we add uninformative dimensions (distractors) to the observation:
t.5 Bernoulli GLM
Inference of a 10-parameter Generalized linear model (GLM) with Bernoulli observations, and Gaussian prior with covariance matrix which encourages smoothness by penalizing the second-order differences in the vector of parameters (DeNicolao1997). The observations are the sufficient statistics for this GLM:
t.6 Bernoulli GLM Raw
This task is similar to T.5, the sole difference being that the observations are not the sufficient statistics for the Bernoulli GLM process but the raw observations:
|Fixed parameters||Duration of task .|
t.7 Gaussian Mixture
This task is common in the ABC literature. It consists of inferring the common mean of a mixture of two two-dimensional Gaussian distributions, one with much broader covariance than the other:
|References||sisson2007; beaumont2009; toni2009; simola2020|
t.8 Two Moons
A two-dimensional task with a posterior that exhibits both global (bimodality) and local (crescent shape) structure to illustrate how algorithms deal with multimodality: