Abstract
Recent advances in probabilistic modelling have led to a large number of simulationbased inference algorithms which do not require numerical evaluation of likelihoods. However, a public benchmark with appropriate performance metrics for such ‘likelihoodfree’ algorithms has been lacking. This has made it difficult to compare algorithms and identify their strengths and weaknesses. We set out to fill this gap: We provide a benchmark with inference tasks and suitable performance metrics, with an initial selection of algorithms including recent approaches employing neural networks and classical Approximate Bayesian Computation methods. We found that the choice of performance metric is critical, that even stateoftheart algorithms have substantial room for improvement, and that sequential estimation improves sample efficiency. Neural networkbased approaches generally exhibit better performance, but there is no uniformly best algorithm. We provide practical advice and highlight the potential of the benchmark to diagnose problems and improve algorithms. The results can be explored interactively on a companion website. All code is open source, making it possible to contribute further benchmark tasks and inference algorithms.
references \defaultbibliographystylehumannat \tcbuselibraryskins \newtcolorboxtbox[1][]enhanced, notitle, standard jigsaw, boxrule=0.1mm, top=3mm, bottom=4mm, colframe=black!50!white, colback=black!2, coltitle=black, fonttitle=, colbacktitle=black!2, subtitle style=frame hidden, opacityframe=0., #1
JanMatthis Lueckmann, Jan Boelts, David S. Greenberg, Pedro J. Gonçalves, Jakob H. Macke
1 Introduction
Many domains of science, engineering, and economics make extensive use of models implemented as stochastic numerical simulators (gourieroux1993indirect; ratmann2007using; alsing2018; brehmer2018constraining; karabatsos2018approximate; gonccalves2019training). A key challenge when studying and validating such simulationbased models is the statistical identification of parameters which are consistent with observed data. In many cases, calculation of the likelihood is intractable or impractical, rendering conventional statistical approaches inapplicable. The goal of simulationbased inference (SBI), also known as ‘likelihoodfree inference’, is to perform Bayesian inference without requiring numerical evaluation of the likelihood function (sisson2018_chapter1; cranmer2019). In SBI, it is generally not required that the simulator is differentiable, nor that one has access to its internal random variables.
In recent years, several new SBI algorithms have been developed (e.g., gutmann2015; papamakarios2016; lueckmann2017; chan2018; greenberg2019; papamakarios2019a; prangle2019distilling; brehmer2020a; hermans2019; jarvenpaa2020; picchini2020; rodrigues2020; thomas2020), energized, in part, by advances in probabilistic machine learning (rezende2016; papamakarios2017; papamakarios2019c). Despite—or possibly because—of these rapid and exciting developments, it is currently difficult to assess how different approaches relate to each other theoretically and empirically: First, different studies often use different tasks and metrics for comparison, and comprehensive comparisons on multiple tasks and simulation budgets are rare. Second, some commonly employed metrics might not be appropriate or might be biased through the choice of hyperparameters. Third, the absence of a benchmark has made it necessary to reimplement tasks and algorithms for each new study. This practice is wasteful, and makes it hard to rapidly evaluate the potential of new algorithms. Overall, it is difficult to discern the most promising approaches and decide on which algorithm to use when. These problems are exacerbated by the interdisciplinary nature of research on SBI, which has led to independent development and coexistence of closelyrelated algorithms in different disciplines.
There are many exciting challenges and opportunities ahead, such as the scaling of these algorithms to highdimensional data, active selection of simulations, and graybox settings, as outlined in cranmer2019. To tackle such challenges, researchers will need an extensible framework to compare existing algorithms and test novel ideas. Carefully curated, a benchmark framework will make it easier for researchers to enter SBI research, and will fuel the development of new algorithms through community involvement, exchange of expertise and collaboration. Furthermore, benchmarking results could help practitioners to decide which algorithm to use on a given problem of interest, and thereby contribute to the dissemination of SBI.
The catalyzing effect of benchmarks has been evident, e.g., in computer vision (russakovsky2015imagenet), speech recognition (hirsch2000aurora; wang2018glue), reinforcement learning (bellemare2013arcade; duan2016benchmarking), Bayesian deep learning (filos2019systematic; wenzel2020), and many other fields drawing on machine learning. Open benchmarks can be an important component of transparent and reproducible computational research. Surprisingly, a benchmark framework for SBI has been lacking, possibly due to the challenging endeavor of designing benchmarking tasks and defining suitable performance metrics.
Here, we begin to address this challenge, and provide a benchmark framework for SBI to allow rapid and transparent comparisons of current and future SBI algorithms: First, we selected a set of initial algorithms representing distinct approaches to SBI (Fig. 1; cranmer2019). Second, we analyzed multiple performance metrics which have been used in the SBI literature. Third, we implemented ten tasks including tasks popular in the field. The shortcomings of commonly used metrics led us to focus on tasks for which a likelihood can be evaluated, which allowed us to calculate reference (‘groundtruth’) posteriors. These reference posteriors are made available to allow rapid evaluation of SBI algorithms. Code for the framework is available at github.com/sbibenchmark/sbibm and we maintain an interactive version of all results at sbibenchmark.github.io.
The full potential of the benchmark will be realized when it is populated with additional communitycontributed algorithms and tasks. However, our initial version already provides useful insights: 1) the choice of performance metric is critical; 2) the performance of the algorithms on some tasks leaves substantial room for improvement; 3) sequential estimation generally improves sample efficiency; 4) for small and moderate simulation budgets, neuralnetwork based approaches outperform classical ABC algorithms, confirming recent progress in the field; and 5) there is no algorithm to rule them all. The performance ranking of algorithms is taskdependent, pointing to a need for better guidance or automated procedures for choosing which algorithm to use when. We highlight examples of how the benchmark can be used to diagnose shortcomings of algorithms and facilitate improvements. We end with a discussion of the limitations of the benchmark.
2 Benchmark
The benchmark consists of a set of algorithms, performance metrics and tasks. Given a prior over parameters , a simulator to sample and an observation , an algorithm returns an approximate posterior , or samples from it, . The approximate solution is tested, according to a performance metric, against a reference posterior .
2.1 Algorithms
Following the classification introduced in the review by cranmer2019, we selected algorithms addressing SBI in four distinct ways, as schematically depicted in Fig. 1. An important difference between algorithms is how new simulations are acquired: Sequential algorithms adaptively choose informative simulations to increase sample efficiency. While crucial for expensive simulators, it can require nontrivial algorithmic steps and hyperparameter choices. To evaluate whether the potential is realized empirically and justifies the algorithmic burden, we included sequential and nonsequential counterparts for algorithms of each category.
Keeping our initial selection focused allowed us to carefully consider implementation details and hyperparameters: We extensively explored performance and sensitivity to different choices in more than 10k runs, all results and details of which can be found in Appendix H. Our selection is briefly described below, full algorithm details are in Appendix A.
REJABC and SMCABC. Approximate Bayesian Computation (ABC, sisson2018_chapter1) is centered around the idea of Monte Carlo rejection sampling (tavere1997; pritchard1999). Parameters are sampled from a proposal distribution, simulation outcomes are compared with observed data , and are accepted or rejected depending on a (userspecified) distance function and rejection criterion. While rejection ABC (REJABC) uses the prior as a proposal distribution, the efficiency can be improved by using sequentially refined proposal distributions (SMCABC, beaumont2002; marjoram2006; sisson2007; toni2009; beaumont2009). We implemented REJABC with quantilebased rejection and used the scheme of beaumont2009 for SMCABC. We extensively varied hyperparameters and compared the implementation of an ABCtoolbox (klinger2018) against our own (Appendix H). We investigated linear regression adjustment (blum2010) and the summary statistics approach by prangle2014semi (Suppl. Fig. 1).
NLE and SNLE. Likelihood estimation (or ‘synthetic likelihood’) algorithms learn an approximation to the intractable likelihood, for an overview see sisson2018_chapter12. While early incarnations focused on Gaussian approximations (SL; wood2010), recent versions utilize deep neural networks (papamakarios2019a; lueckmann2019) to approximate a density over , followed by MCMC to obtain posterior samples. Since we primarily focused on these latter versions, we refer to them as neural likelihood estimation (NLE) algorithms, and denote the sequential variant with proposals as SNLE. In particular, we used the scheme proposed by papamakarios2019a which uses masked autoregressive flows (MAFs, papamakarios2017) for density estimation. We improved MCMC sampling for (S)NLE and compared MAFs against Neural Spline Flows (NSFs; durkan2019neural), see Appendix H.
NPE and SNPE. Instead of approximating the likelihood, these approaches directly target the posterior. Their origins date back to regression adjustment approaches (blum2010). Modern variants (papamakarios2016; lueckmann2017; greenberg2019) use neural networks for density estimation (approximating a density over ). Here, we used the recent algorithmic approach proposed by greenberg2019 for sequential acquisitions. We report performance using NSFs for density estimation, which outperformed MAFs (Appendix H).
NRE and SNRE. Ratio Estimation approaches to SBI use classifiers to approximate density ratios (izbicki2014high; pham2014note; cranmer2015; dutta2016likelihood; durkan2020; thomas2020). Here, we used the recent approach proposed by hermans2019 as implemented in durkan2020: A neural networkbased classifier approximates probability ratios and MCMC is used to obtain samples from the posterior. SNRE denotes the sequential variant of neural ratio estimation (NRE). In Appendix H we compare different classifier architectures for (S)NRE.
In addition to these eight approaches, we benchmarked Random Forest ABC (RFABC; raynal2018), a recent ABC variant, and Synthetic Likelihood (SL; wood2010), mentioned above. However, RFABC only targets individual parameters (i.e. assumes posteriors to factorize), and SL requires new simulations for every MCMC step, thus requiring orders of magnitude more simulations than other algorithms. Therefore, we report results for these algorithms separately, in Suppl. Fig. 2 and Suppl. Fig. 3, respectively.
Algorithms can be grouped with respect to how their output is represented: 1) some return samples from the posterior, (REJABC, SMCABC); 2) others return samples and allow evaluation of unnormalized posteriors ((S)NLE, (S)NRE); and 3) for some, the posterior density can be evaluated and sampled directly, without MCMC ((S)NPE). As discussed below, these properties constrain the metrics that can be used for comparison.
2.2 Performance metrics
Choice of a suitable performance metric is central to any benchmark. As the goal of SBI algorithms is to perform full inference, the ‘gold standard’ would be to quantify the similarity between the true posterior and the inferred one with a suitable distance (or divergence) measure on probability distributions. This would require both access to the groundtruth posterior, and a reliable means of estimating similarity between (potentially) richly structured distributions. Several performance metrics have been used in past research, depending on the constraints imposed by knowledge about groundtruth and the inference algorithm (see Table 1). In realworld applications, typically only the observation is known. However, in a benchmarking setting, it is reasonable to assume that one has at least access to the groundtruth parameters . There are two commonly used metrics which only require and , but suffer severe drawbacks for our purposes:
Probability . The negative log probability of true parameters averaged over different , , has been used extensively in the literature (papamakarios2016; durkan2018; greenberg2019; papamakarios2019a; durkan2020; hermans2019). Its appeal lies in the fact that one does not need access to the groundtruth posterior. However, using it only for a small set of is highly problematic: It is only a valid performance measure if averaged over a large set of observations sampled from the prior (talts2018, detailed discussion including connection to simulationbased calibration in Appendix M). For reliable results, one would require inference for hundreds of which is only feasible if inference is rapid (amortized) and the density can be evaluated directly (among the algorithms used here this applies only to NPE).
Ground truth  

Algorithm  
1  1  1, 3  1, 3, 4  1, 3, 4  
1  1  1, 3  1, 3, 4  1, 3, 4  
1  1, 2  1, 2, 3  1, 2, 3, 4  1, 2, 3, 4, 5 
1 = PPCs, 2 = Probability , 3 = 2sample tests, 4 = 1sample tests, 5 = divergences.
PosteriorPredictive Checks (PPCs). As the name implies, PPCs should be considered a mere check rather than a metric, although the median distance between predictive samples and has been reported in the SBI literature (papamakarios2019a; greenberg2019; durkan2020). A failure mode of such a metric is that an algorithm obtaining a good MAP point estimate, could perfectly pass this check even if the estimated posterior is poor. Empirically, we found mediandistances (MEDDIST) to be in disagreement with other metrics (see 3 Results).
The shortcomings of these commonlyused metrics led us to focus on tasks for which it is possible to get samples from groundtruth posterior , thus allowing us to use metrics based on twosample tests:
Maximum Mean Discrepancy (MMD). MMD (gretton2012; sutherland2017) is a kernelbased 2sample test. Recent papers (papamakarios2019a; greenberg2019; hermans2019) reported MMD using translationinvariant Gaussian kernels with length scales determined by the median heuristic (ramdas2015). We empirically found that MMD can be sensitive to hyperparameter choices, in particular on posteriors with multiple modes and length scales (see 3 Results and liu2020).
Classifier 2Sample Tests (C2ST). C2STs (friedman2004; lopezpaz2018) train a classifier to discriminate samples from the true and inferred posteriors, which makes them simple to apply and easy to interpret. Therefore, we prefer to report and compare algorithms in terms of accuracy in classificationbased tests. In the context of SBI, C2ST has e.g. been used in gutmann2018likelihood; dalmasso2019a.
Other metrics that could be used include:
Kernelized Stein Discrepancy (KSD). KSD (liu2016; chwialkowski2016) is a 1sample test, which require access to rather than samples from ( is the unnormalized posterior). Like MMD, current estimators use translationinvariant kernels.
Divergences. Divergences such as Total Variation (TV) divergence and KL divergences can only be computed when the densities of true and approximate posteriors can be evaluated (Table 1). Thus, we did not use divergences for the benchmark.
Full discussion and details of metrics in Appendix M.
2.3 Tasks
The preceding considerations guided our selection of inference tasks: We focused on tasks for which reference posterior samples can be obtained, to allow calculation of 2sample tests. We focused on eight purely statistical problems and two problems relevant in applied domains, with diverse dimensionalities of parameters and data (details in Appendix T):
Gaussian Linear/Gaussian Linear Uniform. We included two versions of simple, linear, 10d Gaussian models, in which the parameter is the mean, and the covariance is fixed. The first version has a Gaussian (conjugate) prior, the second one a uniform prior. These tasks allow us to test how algorithms deal with trivial scaling of dimensionality, as well as truncated support.
SLCP/SLCP Distractors. A challenging inference task designed to have a simple likelihood and a complex posterior (papamakarios2019a; greenberg2019): The prior is uniform over five parameters and the data are a set of four twodimensional points sampled from a Gaussian likelihood whose mean and variance are nonlinear functions of . This induces a complex posterior with four symmetrical modes and vertical cutoffs. We included a second version with 92 additional, noninformative output dimensions (distractors) to investigate the ability of SBI algorithms to detect informative features.
Bernoulli GLM/Bernoulli GLM Raw. 10parameter Generalized Linear Model (GLM) with Bernoulli observations. Inference was either performed on sufficient statistics (10d) or raw data (100d).
Gaussian Mixture. This inference task, introduced by sisson2007, has become common in the ABC literature (beaumont2009; toni2009; simola2020). It consists of a mixture of two twodimensional Gaussian distributions, one with much broader covariance than the other.
Two Moons. A twodimensional task with a posterior that exhibits both global (bimodality) and local (crescent shape) structure (greenberg2019) to illustrate how algorithms deal with multimodality.
SIR. Dynamical systems represent paradigmatic use cases for SBI. SIR is an influential epidemiological model describing the dynamics of the number of individuals in three possible states: susceptible , infectious , and recovered or deceased, . We infer the contact rate and the mean recovery rate , given observed infection counts at 10 evenlyspaced time points.
LotkaVolterra. An influential model in ecology describing the dynamics of two interacting species, widely used in SBI studies. We infer four parameters related to species interaction, given the number of individuals in both populations at 10 evenlyspaced points in time.
2.4 Experimental Setup
For each task, we sampled 10 sets of true parameters from the prior and generated corresponding observations . For each observation, we generated 10k samples from the reference posterior. Some reference posteriors required a customised (likelihoodbased) approach (Appendix B).
In SBI, it is typically assumed that total computation cost is dominated by simulation time. We therefore report performance at different simulation budgets. For each observation, each algorithm was run with a simulation budget ranging from 1k to 100k simulations.
For each run, we calculated metrics described above. To estimate C2ST accuracy, we trained a multilayer perceptron to tell apart approximate and reference posterior samples and performed fivefold crossvalidation. We used two hidden layers, each with 10 times as many ReLu units as the dimensionality of the data. We also measured and report runtimes (Appendix R).
2.5 Software
Code. All code is released publicly at github.com/sbibenchmark/sbibm. Our framework includes tasks, reference posteriors, metrics, plotting, and infrastructure tooling and is designed to be 1) easily extensible, 2) used with external toolboxes implementing algorithms. All tasks are implemented as probabilistic programs in Pyro (bingham2018), so that likelihoods and gradients for reference posteriors can be extracted automatically. To make this possible for tasks that use ODEs, we developed a new interface between DifferentialEquations.jl (rackauckas2017; bezanson2017julia) and PyTorch (paszke2019). In addition, specifying simulators in a probabilistic programming language has the advantage that ‘graybox’ algorithms (brehmer2020a; cranmer2019) can be added in the future. We here evaluated algorithms implemented in pyABC (klinger2018), pyabcranger (collin2020), and sbi (sbi). See Appendix B for additional details on our code and existing SBI toolboxes.
Reproducibility. Instructions to reproduce experiments on cloudbased infrastructure are in Appendix B.
Website. Along with the code, we provide a web interface which allows interactive exploration of all the results (sbibenchmark.github.io; Appendix W).
3 Results
We first consider empirical results on a single task, Two Moons, according to different metrics, which illustrate the following important insight:
#1: Choice of performance metric is key. While C2ST results on Two Moons show that performance increases with higher simulation budgets and that sequential algorithms outperform nonsequential ones for low to medium budgets, these results were not reflected in MMD and MEDDIST (Fig. 2): In our analyses, we found MMD to be sensitive to hyperparameter choices, in particular on tasks with complex posterior structure. When using the commonly employed median heuristic to set the kernel length scale on a task with multimodal posteriors (like Two Moons), MMD had difficulty discerning markedly different posteriors. This can be ‘fixed’ by using hyperparameters adapted to the task (Suppl. Fig. 4). As discussed above, the median distance (though commonly used) can be ‘gamed’ by a good point estimate even if the estimated posterior is poor and is thus not a suitable performance metric. Computation of KSD showed numerical problems on Two Moons, due to the gradient calculation.
We assessed relationships between metrics empirically via the correlations across tasks (Suppl. Fig. 5). As discussed above, the logprobability of groundtruth parameters can be problematic when averaged over too few observations (e.g., 10, as is common in the literature): indeed, this metric had a correlation of only 0.3 with C2ST on Two Moons and 0.6 on the SLCP task. Based on these considerations, we used C2ST for reporting performance (Fig. 3; results for MMD, KSD and median distance on the website).
Based on the comparison of the performance across all tasks, we highlight the following main points:
#2: These are not solved problems. C2ST uses an interpretable scale (1 to 0.5), which makes it possible to conclude that, for several tasks, no algorithm could solve them with the specified budget (e.g., SLCP, LotkaVolterra). This highlights that our problems—though conceptually simple—are challenging, and there is room for development of more powerful algorithms.
#3: Sequential estimation improves sample efficiency. Our results show that sequential algorithms outperform nonsequential ones (Fig. 3). The difference was small on simple tasks (i.e. linear Gaussian cases), yet pronounced on most others. However, we also found these methods to exhibit diminishing returns as the simulation budget grows, which points to an opportunity for future improvements.
#4: Density or ratio estimationbased algorithms generally outperform classical techniques. REJABC and SMCABC were generally outperformed by more recent techniques which use neural networks for density or ratioestimation, and which can therefore efficiently interpolate between different simulations (Fig. 3). Without such modelbased interpolation, even a simple 10d Gaussian task can be challenging. However, classical rejectionbased methods have a computational footprint that is orders of magnitudes faster, as no network training is involved (Appendix R). Thus, on lowdimensional problems and for cheap simulators, these methods can still be competitive. See Suppl. Fig. 1 for results with additional ABC variants (blum2010; prangle2014semi) and Suppl. Fig. 2 for results on RFABC.
#5: No one algorithm to rule them all. Although sequential density or ratio estimationbased algorithms performed better than their nonsequential counterparts, there was no clearcut answer as to which sequential method (SNLE, SNRE, and SNPE) should be preferred. To some degree, this is to be expected: these algorithms have distinct strengths that can play out differently depending on the problem structure (see discussions e.g., in greenberg2019; durkan2018; durkan2020). However, this has not been shown systematically before. We formulate some practical guidelines for choosing appropriate algorithms in Appendix P.
#6: The benchmark can be used to diagnose implementation issues and improve algorithms. For example, (S)NLE and (S)NRE rely on MCMC sampling to compute posteriors, and this sampling step can limit the performance. Access to a reference posterior can help identify and improve such issues: We found that single chains initialized by sampling from the prior with axisaligned slice sampling (as introduced in papamakarios2019a) frequently got stuck in single modes. Based on this observation, we changed the MCMC strategy (details in Appendix A), which, though simple, yielded significant performance and speed improvements on the benchmark tasks. Similarly, (S)NLE and (S)NRE improved by transforming parameters to be unbounded: Without transformations, runs on some tasks can get stuck during MCMC sampling (e.g., LotkaVolterra). While this is common advice for MCMC (hogg2017), it has been lacking in code and papers of SBI approaches.
We used the benchmark to systematically compare hyperparameters: For example, as density estimators for (S)NLE and (S)NPE, we used NSFs (durkan2020) which were developed after these algorithms were published. This revealed that higher capacity density estimators were beneficial for posterior but not likelihood estimation (detailed analysis in Appendix H).
These examples show how the benchmark makes it possible to diagnose problems and improve algorithms.
4 Limitations
Our benchmark, in its current form, has several limitations. First, the algorithms considered here do not cover the entire spectrum of SBI algorithms: We did not include sequential algorithms using active learning or Bayesian Optimization (gutmann2015; jarvenpaa2019; lueckmann2019; aushev2020), or ‘graybox’ algorithms, which use additional information about or from the simulator (e.g., baydin2019etalumis; brehmer2020a). We focused on approaches using neural networks for density estimation and did not compare to alternatives using Gaussian Processes (e.g., meeds2014; wilkinson2014). There are many other algorithms which the benchmark is currently lacking (e.g., nott2014; ong2018; clarte2019; prangle2019distilling; priddle2019; picchini2020; radev2020bayesflow; rodrigues2020). Keeping our initial selection small allowed us to carefully investigate hyperparameter choices. We focused on sequential algorithms with less sophisticated acquisition schemes and the blackbox scenario, since we think these are important baselines for future comparisons.
Second, the tasks we considered do not cover the variety of possible challenges. Notably, while we have tasks with high dimensional data with and without structure, we have not included tasks with highdimensional spatial structure, e.g., images. Such tasks would require algorithms that automatically learn summary statistics while exploring the structure of the data (e.g., dinev2018; greenberg2019; hermans2019; chen2020neural), an active research area.
Third, while we extensively investigated tuning choices and compared implementations, the results might nevertheless reflect our own areas of expertise.
Fourth, in line with common practice in SBI, results presented in the paper focused on performance as a function of the number of simulation calls. It is important to remember that differences in computation time can be substantial (see Appendix R): For example, (S)ABC was much faster than approaches requiring network training. Overall, sequential neural algorithms exhibited longest runtimes.
Fifth, for reasons described above, we focused on problems for which reference posteriors can be computed. This raises the question of how insights on these problems will generalize to ‘realworld’ simulators. Notably, even these simple problems already identify clear differences between, and limitations of, different SBI approaches. Since it is not possible to rigorously compare the performance of different algorithms directly on ‘realworld’ simulators due to the lack of appropriate metrics, we see the benchmark as a necessary stepping stone towards the development of (potentially automated) selection strategies for practical problems.
Sixth, in practice, the choice of algorithm can depend on aspects that are difficult to quantify: It will depend on the available information about a problem, the inference goal, and the speed of the simulator, among other considerations. We included some practical considerations and recommendations in Appendix P.
Finally, benchmarking is an important tool, but not an end in itself—for example, conceptually new ideas might initially not yield competitive results but only reveal their true value later. Conversely, ‘overfitting’ on benchmarks can lead to the illusion of progress, and result in an undue focus on small implementation details which might not generalize beyond it. It would certainly be possible to cheat on this benchmark: In particular, as the simulators are available, one could use samples (or even likelihoods) to excessively tune hyperparameters for each task. This would hardly transfer to practice where such tuning is usually impossible (lack of metrics and expensive simulators). Therefore, we carefully compared choices and selected hyperparameters performing best across tasks (Appendix H).
5 Discussion
Quantitatively evaluating, comparing and improving algorithms through benchmarking is at the core of progress in machine learning. We here provided an initial benchmark for simulationbased inference. If used sensibly, it will be an important tool for clarifying and expediting progress in SBI. We hope that the current results on multiple widelyused algorithms already provide insights into the state of the field, assist researchers with algorithm development, and that our recommendations for practitioners will help them in selecting appropriate algorithms.
We believe that the full potential of the benchmark will be revealed as more researchers participate and contribute. To facilitate this process, and allow users to quickly explore and compare algorithms, we are providing precomputed reference posteriors, a website (sbibenchmark.github.io), and opensource code (github.com/sbibenchmark/sbibm).
Acknowledgements
We thank Álvaro TejeroCantero, Auguste Schulz, Conor Durkan, François Lanusse, Leandra White, Marcel Nonnenmacher, Michael Deistler, Pedro Rodrigues, Poornima Ramesh, Sören Becker and Theofanis Karaletsos for discussions and comments on the manuscript. In addition, J.M.L. would like to thank the organisers and participants of the LikelihoodFree Inference Workshop hosted by the Simons Foundation for discussions, in particular, Danley Hsu, François Lanusse, George Papamakarios, Henri Pesonen, Joeri Hermans, Johann Brehmer, Kyle Cranmer, Owen Thomas and Umberto Simola. We also acknowledge and thank the Python (van1995python) and Julia (bezanson2017julia) communities for developing the tools enabling this work, including Altair (altair), DifferentialEquations.jl (rackauckas2017), Hydra (hydra), kernelgof (kgof), igms (igms), NumPy (harris2020), pandas (pandas), pyABC (klinger2018), pyabcranger (collin2020), Pyro (bingham2018), PyTorch (paszke2019), sbi (sbi), Scikitlearn (scikitlearn), torchtwosample (tts), and vegalite (vegalite).
This work was supported by the German Research Foundation (DFG; SFB 1233 PN 276693517, SFB 1089, SPP 2041, Germany’s Excellence Strategy – EXC number 2064/1 PN 390727645) and the German Federal Ministry of Education and Research (BMBF; project ’ADIMEM’, FKZ 01IS18052 AD).
[section] \printcontents[section]l0
Appendix A Algorithms
a.1 Rejection Approximate Bayesian Computation (RejAbc)
Classical Approximate Bayesian Computation (ABC) is based on Monte Carlo rejection sampling (tavere1997; pritchard1999): In rejection ABC, the evaluation of the likelihood is replaced by a comparison between observed data and simulated data , based on a distance measure . Samples from the approximate posterior are obtained by collecting simulation parameters that result in simulated data that is close to the observed data.
More formally, given observed data , a prior over parameters of simulationbased model , a distance measure and an acceptance threshold , rejection ABC obtains parameter samples from the approximate posterior as outlined in Algorithm 1.
In theory, rejection ABC obtains samples from the true posterior in the limit and , where is the simulation budget. In practice, its accuracy depends on the tradeoff between simulation budget and the rejection criterion . Rejection ABC suffers from the curse of dimensionality, i.e., with linear increase in the dimensionality of , an exponential increase in simulation budget is required to maintain accurate results.
For the benchmark, we did not use a fixed threshold, but quantilebased rejection. Depending on the simulation budget (1k, 10k, 100k), we used a quantile of (0.1, 0.01, or, 0.001), so that REJABC returned 100 samples with smallest distance to in each of these cases (see Appendix H for different hyperparameter choices). In order to compute metrics on 10k samples, we sampled from a KDE fitted on the accepted parameters (details about KDE resampling in Appendix H). REJABC requires the choice of the distance measure : here we used the norm.
a.2 Sequential Monte Carlo Approximate Bayesian Computation (SmcAbc)
Sequential Monte Carlo Approximate Bayesian Computation (SMCABC) algorithms (beaumont2002; marjoram2006; sisson2007; toni2009) are an extension of the classical rejection ABC approach, inspired by importance sampling and sequential Monte Carlo sampling. Central to SMCABC is the idea to approach the final set of samples from the approximate posterior by constructing a series of intermediate sets of samples slowly approaching the final set through perturbations.
Several variants have been developed (e.g., sisson2007; beaumont2009; toni2009; simola2020). Here, we used the scheme ABCPMC scheme of beaumont2009 and refer to it as SMCABC in the manuscript. More formally, the description of the ABCPMC algorithm is as follows: Given observed data , a prior over parameters of a simulationbased model , a distance measure , a schedule of acceptance thresholds , and a kernel to perturb intermediate samples, weighted samples of the approximate posterior are obtained as described in Algorithm 2.
SMCABC can improve the sampling efficiency compared to REJABC and avoids severe inefficiencies due to a mismatch between initial sampling and the target distribution. However, it comes with more hyperparameters that can require careful tuning to the problem at hand, e.g., the choice of distance measure, kernel, and schedule. Like, REJABC, SMCABC suffers from the curse of dimensionality.
For the benchmark, we considered the popular toolbox pyABC (klinger2018). Additionally, to fully understand the details of the SMCABC approach, we also implemented our own version. In the main paper we report results obtained with our implementation because it yielded slightly better results. A careful comparison of the two approaches, and the optimization of hyperparameters like schedule, population size and perturbation kernel variance across different tasks are shown in Appendix H. After optimization, the crucial parameters of SMCABC were set to: norm as distance metric, quantilebased epsilon decay with 0.2 quantile, population size 100 for simulation budgets 1k and 10k, population size 1000 for simulation budget 100k, Gaussian perturbation kernel with empirical covariance from previous population scaled by 0.5. We obtained 10k samples required for calculation of metrics as follows: If a population is not complete within the simulation budget we completed it with accepted particles from the last population and recalculated all weights. We then fitted a KDE on all those particles and sampled 10k samples from the KDE.
a.3 Neural Likelihood Estimation (Nle)
Likelihood estimation approaches to SBI use density estimation to approximate the likelihood . After learning a surrogate ( denoting the parameters of the estimator) for the likelihood function, one can for example use Markov Chain Monte Carlo (MCMC) based sampling algorithms to obtain samples from the approximate posterior . This idea dates back to using Gaussian approximations of the likelihood (wood2010; sisson2018_chapter12), and more recently, was extended to density estimation with neural networks (papamakarios2019a; lueckmann2019).
We refer to the singleround version of the (sequential) neural likelihood approach by papamakarios2019a as NLE, and outline it in Algorithm 3: Given a set of samples obtained by sampling from the prior and simulating , we train a conditional neural density estimator modelling the conditional of data given parameters on the set . Training proceeds by maximizing the log likelihood . Given enough simulations, a sufficiently flexible conditional neural density estimator approximates the likelihood in the support of the prior (papamakarios2019a). Once is trained, samples from the approximate posterior are obtained using MCMC sampling based on the approximate likelihood and the prior .
For MCMC sampling, papamakarios2019a suggest to use Slice Sampling (neal2003) with a single chain. However, we observed that the accuracy of the obtained posterior samples can be substantially improved by changing the Slice Sampling scheme as follows: 1) Instead of a single chain, we used 100 parallel MCMC chains; 2) for initialization of the chains, we sampled 10k candidate parameters from the prior, evaluated them under the unnormalized approximate posterior, and used these values as weights to resample initial locations; 3) we transformed parameters to be unbounded as suggested e.g. in bingham2018; carpenter2017; hogg2017. In addition, we reimplemented the slice sampler to allow vectorized evaluations of the likelihood, which yielded significant computational speedups.
For the benchmark, we used as density estimator a Masked Autoregressive Flow (MAF, papamakarios2017) with five flow transforms, each with two blocks and 50 hidden units, nonlinearity and batch normalization after each layer. For the MCMC step, we used the scheme as outlined above with 200 warmup steps and tenfold thinning, to obtain 10k samples from the approximate posterior (1k samples from each chain). In Appendix H we show results for all tasks obtained with a Neural Spline Flow (NSF, durkan2019neural) for density estimation, using five flow transforms, two residual blocks of 50 hidden units each, ReLU nonlinearity, and 10 bins.
a.4 Sequential Neural Likelihood Estimation (Snle)
Sequential Neural Likelihood estimation (SNLE or SNL, papamakarios2019a) extends the neural likelihood estimation approach described in the previous section to be sequential.
The idea behind sequential SBI algorithms is based on the following intuition: If for a particular inference problem, there is only a single one is interested in, then simulating data using parameters from the entire prior space might be inefficient, leading to a training set that contains training data which carries little information about the posterior . Instead, to increase sample efficiency, one may draw training data points from a proposal distribution , ideally obtaining for which is close to . One candidate that has been commonly used in the literature for such a proposal is the approximate posterior distribution itself.
SNLE is a multiround version of NLE, where in each round new training samples are drawn from a proposal . The proposal is chosen to be the posterior estimate at from the previous round and its samples are obtained using MCMC. The proposal controls where is learned most accurately. Thus, by iterating over multiple rounds, a good approximation to the posterior can be learned more efficiently than by sampling all training data from the prior. SNLE is summarized in Algorithm 4.
For the benchmark, we used as density estimator a Masked Autoregressive Flow (papamakarios2017), and MCMC to obtain posterior samples after every round, both with the same settings as described for NLE. The simulation budget was equally split across 10 rounds. In Appendix H, we show results for all tasks obtained with a Neural Spline Flow (NSF, durkan2019neural) for density estimation, using five flow transforms, two residual blocks of 50 hidden units each, ReLU nonlinearity, and 10 bins.
a.5 Neural Posterior Estimation (Npe)
NPE uses conditional density estimation to directly estimate the posterior. This idea dates back to regression adjustment approaches (blum2010) and was extended to density estimators using neural networks (papamakarios2016) more recently.
As outlined in Algorithm 5, the approach is as follows: Given a prior over parameters and a simulator, a set of training data points is generated. This training data is used to learn the parameters of a conditional density estimator using a neural network , i.e., . The loss function is given by the negative log probability . If the density estimator is flexible enough and training data is infinite, this loss function leads to perfect recovery of the groundtruth posterior (papamakarios2016).
For the benchmark, we used the approach by papamakarios2016 with a Neural Spline Flow (NSF, durkan2019neural) as density estimator, using five flow transforms, two residual blocks of 50 hidden units each, ReLU nonlinearity, and 10 bins. We sampled 10k samples from the approximate posterior . In Appendix H, we compare NSFs to Masked Autoregressive Flows (MAFs, papamakarios2017), as used in greenberg2019; durkan2020, with five flow transforms, each with two blocks and 50 hidden units, nonlinearity and batch normalization after each layer.
a.6 Sequential Neural Posterior Estimation (Snpe)
Sequential Neural Posterior Estimation SNPE is the sequential analog of NPE, and meant to increase sample efficiency (see also subsection A.4). When the posterior is targeted directly, using a proposal distribution different from the prior requires a correction step—without it, the posterior under the proposal distribution would be inferred (papamakarios2016). This socalled proposal posterior is denoted by :
where . Note that for , it directly follows that .
There have been three different approaches to this correction step so far, leading to three versions of SNPE (papamakarios2016; lueckmann2017; greenberg2019). All three algorithms have in common that they train a neural network to learn the parameters of a family of densities to estimate the posterior. They differ in what is targeted by and which loss is used for .
SNPEA (papamakarios2016) trains to target the proposal posterior by minimizing the log likelihood loss , and then posthoc solves for . The analytical posthoc step places restrictions on , the proposal, and prior. papamakarios2016 used Gaussian mixture density networks, single Gaussians proposals, and Gaussian or uniform priors. SNPEB (lueckmann2017) trains with the importance weighted loss to directly recover without the need for posthoc correction, removing restrictions with respect to , the proposal, and prior. However, the importance weights can have high variance during training, leading to inaccurate inference for some tasks (greenberg2019). SNPEC (APT) (greenberg2019) alleviates this issue by reparameterizing the problem such that it can infer the posterior by maximizing an estimated proposal posterior. It trains to approximate with , using a loss defined on the approximate proposal posterior . greenberg2019 introduce ‘atomic’ proposals to allow for arbitrary choices of the density estimator, e.g., flows (papamakarios2019c): The loss on is calculated as the expectation over proposal sets sampled from a socalled ‘hyperproposal’ as outlined in Algorithm 6 (see greenberg2019, for full details).
For the benchmark, we used the approach by greenberg2019 with ‘atomic’ proposals and referred to it as SNPE. As density estimator, we used a Neural Spline Flow (durkan2019neural) with the same settings as for NPE. For the ‘atomic’ proposals, we used atoms (larger was too demanding in terms of memory). The simulation budget was equally split across 10 rounds and for the final round, we obtained 10k samples from the approximate posterior . In Appendix H, we compare NSFs to Masked Autoregressive Flows (MAFs, papamakarios2017), as used in greenberg2019; durkan2020, with five flow transforms, each with two blocks and 50 hidden units, nonlinearity and batch normalization after each layer.
a.7 Neural Ratio Estimation (Nre)
Neural ratio estimation (NRE) uses neuralnetwork based classifiers to approximate the posterior . While neuralnetwork based approaches described in the previous sections use density estimation to either estimate the likelihood ((S)NLE) or the posterior ((S)NPE), NRE algorithms ((S)NRE) use classification to estimate a ratio of likelihoods. The ratio can then be used for posterior evaluation or MCMCbased sampling.
Likelihood ratio estimation can be used for SBI because it allows to perform MCMC without evaluating the intractable likelihood. In MCMC, the transition probability from a current parameter to a proposed parameter depends on the posterior ratio and in turn on the likelihood ratio between the two parameters:
Therefore, given a ratio estimator learned from simulations, one can perform MCMC to obtain samples from the posterior, even if evaluating is intractable.
hermans2019 proposed the following approach for MCMC with classifiers to approximate density ratios: A classifier is trained to distinguish samples from an arbitrary and samples from the marginal model . This results in a likelihoodtoevidence estimator that needs to be trained only once to be evaluated for any . The training of the classifier proceeds by minimizing the binary crossentropy loss (BCE), as outlined in Algorithm 7. Once the classifier is parameterized, it can be used to perform MCMC to obtain samples from the posterior. The authors name their approach Amortized Approximate Likelihood Ratio MCMC (AALRMCMC): It is amortized because once the likelihood ratio estimator is trained, it is possible to run MCMC for any .
Earlier ratio estimation algorithms for SBI (e.g., izbicki2014high; pham2014note; cranmer2015; dutta2016likelihood) and their connections to recent methods are discussed in thomas2020, as well as in durkan2020. AALRMCMC is closely related to LFIRE (dutta2016likelihood) but trains an amortized classifier rather than a separate one per posterior evaluation. durkan2020 showed that the loss of AALRMCMC is closely related to the atomic SNPEC/APT approach of greenberg2019 (SNPE) and that both can be combined in a unified framework. durkan2020 changed the formulation of the loss function for training the classifier from binary to multiclass.
For the benchmark, we used neural ratio estimation (NRE) as formulated by durkan2020 and implemented in the sbi toolbox (sbi). As a classifier, we used a residual network architecture (ResNet) with two hidden layers of 50 units and ReLU nonlinearity, trained with Adam (kingma2014adam). Following the notation of durkan2020, we used as the size of the contrasting set. For the MCMC step, we followed the same procedure as described for NLE, i.e., using Slice Sampling with 100 chains, to obtain 10k samples from each approximate posterior. In Appendix H, we show results for all tasks obtained with a multilayer perceptron (MLP) architecture with two hidden layers of 50 ReLu units, and batch normalization.
a.8 Sequential Neural Ratio Estimation (Snre)
Sequential Neural Ratio Estimation (SNRE) is the sequential version of NRE, and meant to increase sample efficiency, at the cost of needing to train new classifiers for different .
A sequential version of neural ratio estimation was proposed by hermans2019. As with other sequential algorithms, the idea is to replace the prior by a proposal distribution that is focused on in the sense that the sampled parameters result in simulated data that are informative about . The proposal for the next round is the posterior estimate from the previous round. The ratio estimator then becomes and is refined over rounds by training the underlying classifier with positive examples and negative examples . Exact posterior evaluation is not possible anymore, but samples can be obtained as before via MCMC. These steps are outlined in Algorithm 8.
For the benchmark, we used SNRE as formulated by durkan2020 and implemented in the sbi toolbox (sbi). The classifier had the same architecture as described for NRE. For the MCMC step, we followed the same procedure as described for NLE. The simulation budget was equally split across 10 rounds. In Appendix H, we show results for all tasks obtained with a multilayer perceptron (MLP) architecture with two hidden layers of 50 ReLu units, and batch normalization.
a.9 Random Forest Approximate Bayesian Computation (RfAbc)
Random forest Approximate Bayesian Computation (RFABC, pudlo2016; raynal2018) is a more recently developed ABC algorithm based on a regression approach. Similar to previous regression approaches to ABC (beaumont2002; blum2010), RFABC aims at improving classical ABC inference (REJABC, SMCABC) in the setting of highdimensional data.
The idea of the RFABC algorithm is to use random forests (RF, breiman2001) to run a nonparametric regression of a set of potential summary statistics of the data on the corresponding parameters. That is, the RF regression is trained on data simulated from the model, such that the covariates are the summary statistics and the response variable is a parameter. For a detailed description of the algorithm, we refer to raynal2018.
The only hyperparameters for the RFABC algorithm are the number of trees and the minimum node size for the RF regression. Following raynal2018, we chose the default of 500 trees and a minimum of 5 nodes. The output of the algorithm is a RF weight for each of the simulated parameters. This set of weights can be used to calculate posterior quantiles or to obtain an approximate posterior density as described in raynal2018. We obtained 10k posterior samples for the benchmark by using the random forest weights to sample from the simulated parameters. We used the implementation in the abcranger toolbox collin2020.
One important property of RFABC is that it can only be applied in the unidimensional setting, i.e., for 1D dimensional parameter spaces, or for multidimensional parameters spaces with the assumption that the posterior factorizes over parameters (thus ignoring potential posterior correlations). This assumptions holds only for a few tasks in our benchmark (Gaussian Linear, Gaussian Linear Uniform, Gaussian Mixture). Due to this inherent limitation, we report RFABC in the supplement (see Suppl. Fig. 2).
a.10 Synthetic Likelihood (Sl)
The Synthetic Likelihood (SL) approach circumvents the evaluation of the intractable likelihood by estimating a synthetic one from simulated data or summary statistics. This approach was introduced by wood2010. Its main motivation is that the classical ABC approach of comparing simulated and observed data with a distance metric can be problematic if parts of the differences are entirely noisedriven. wood2010 instead approximated the distribution of the summary statistics (the likelihood) of a nonlinear ecological dynamic system as a Gaussian distribution, thereby capturing the underlying noise as well. The approximation of the likelihood can then be used to obtain posterior sampling via Markov Chain Monte Carlo (MCMC) (wood2010).
The SL approach can be seen as the predecessor of the (S)NLE approaches: They replaced the Gaussian approximation of the likelihood with a much more flexible one that uses neural networks and normalizing flows (see A.3). Moreover, there are modern approaches from the classical ABC field that further developed SL using a Gaussian approximation (e.g., sisson2018_chapter12; priddle2019).
For the benchmark, we implemented our own version of the algorithm proposed by wood2010. We used Slice Sampling MCMC (neal2003) and estimated the Gaussian likelihood from 100 samples at each sampling step. To ensure a positive definite covariance matrix, we added a small value to the diagonal of the estimated covariance matrix for some of the tasks. In particular, we used for SIR and Bernoulli GLM Raw tasks, and we tried without success for LotkaVolterra and SLCP with distractors. For all remaining tasks, we set . For Slice Sampling, we used a single chain initialized with sequential importance sampling (SIR) as described for NLE, 1k warmup steps and no thinning, in order to keep the number of required simulations tractable. This resulted in an overall simulation budget on the order of to simulations per run in order to generate 10k posterior samples, as new simulations are required for every MCMC step.
The high simulation budget makes it problematic to directly compare SL and other other algorithms in the benchmark. Therefore, we report SL in the supplement (see Suppl. Fig. 3).
Appendix B Benchmark
b.1 Reference posteriors
We generated 10k reference posterior samples for each observation. For the T.1 Gaussian Linear task, reference samples were obtained by using the analytic solution for the true posterior. Similarly, for T.2 Gaussian Linear Uniform and T.7 Gaussian Mixture, the analytic solution was used, combined with an additional rejection step, in order to account for the bounded support of the posterior due to the use of a uniform prior. For the T.8 Two Moons task, we devised a custom scheme based on the model equations, which samples both modes and rejects samples outside the prior bounds.
For T.3 SLCP, T.9 SIR, and T.10 LotkaVolterra, we devised a likelihoodbased procedure to ensure obtaining a valid set of reference posterior samples: First, we either used Sampling/Importance Resampling (rubin1988using) (for T.3 SLCP, T.9 SIR) or Slice Sampling MCMC (neal2003) (for T.10 LotkaVolterra) to obtain a set of 10k proposal samples from the unnormalized posterior . We used these proposal samples to train a density estimator, for which we used a neural spline flow (NSF) (durkan2019neural). Next, we created a mixture composed of the NSF and the prior with weights 0.9 and 0.1, respectively, as a proposal distribution for rejection sampling (martino2018accept). Rejection sampling relies on finding a constant such that for all values of : To find this constant, we initialized , sampled , and updated if . This loop stopped only after at least 100k samples without updating were reached. We then used , , and to generate 10k reference posterior samples. We found that the NSFbased proposal distribution resulted in high acceptance rates. We used this custom scheme rather than relying on MCMC directly, since we found that standard MCMC approaches (Slice Sampling, HMC, and NUTS) all struggled with multimodal posteriors and wanted to avoid bias in the reference samples, e.g. due to correlations in MCMC chains.
As a sanity check, we ran this scheme twice on all tasks and observation and found that the resulting reference posterior samples were indistinguishable in terms of C2ST.
b.2 Code
We provide sbibm, a benchmarking framework that implements all tasks, reference posteriors, different metrics and tooling to run and analyse benchmark results at scale. The framework is available at:
We make benchmarking new algorithms maximally easy by providing an open, modular framework for integration with SBI toolboxes. We here evaluated algorithms implemented in pyABC (klinger2018), pyabcranger (collin2020), and sbi (sbi). We emphasize that the goal of sbibm is orthogonal to any toolbox: It could easily be used with other toolboxes, or even be used to compare results for the same algorithm implemented by different ones. There are currently several SBI toolboxes available or under active development. elfi (elfi2018) is a general purpose toolbox, including ABC algorithms as well as BOLFI (gutmann2015). There are many toolboxes for ABC algorithms, e.g., abcpy (abcpyrepo), astroABC (jennings2017astroabc), CosmoABC (ishida2015cosmoabc), see also sisson2018_chapter13 for an overview. carl (louppe2016) implements the algorithm by cranmer2015. hypothesis (hypothesisrepo), and pydelfi (pydelfirepo) are SBI toolboxes under development.
b.3 Reproducibility
To ensure reproducibility of our results, we publicly released all code including instructions on how to run the benchmark on cloudbased infrastructure.
Appendix F Figures
Appendix H Hyperparameter Choices
In this section, we address two central questions for any benchmark: (1) how hyperparameters are chosen and (2) how sensitive results are to the respective choices.
Rather than tuning hyperparameters on a pertask basis, we changed hyperparameters on multiple or all tasks at once and selected configurations that worked best across tasks. We wanted to avoid overfitting on individual benchmark tasks and were instead interested in settings that can generalize across multiple tasks. In practice, tuning an algorithm on a given task would typically be impossible, due to the lack of suitable metrics that can be computed without reference posteriors as well as high computational demands that SBI tasks often have.
To find good general settings, we performed more than 10 000 individual runs. We explored hyperparameter choices that have not been previously reported, and revealed substantial improvements. The benchmark offers the possibility to systematically compare different choices and design better and more robust SBI algorithms.
h.1 RejAbc
Classical ABC algorithms have crucial hyperparameters, most importantly, the distance metric and acceptance tolerance . We used our own implementation of REJABC as it is straightforward to implement (see A.1). The distance metric was fixed to be the norm for all tasks and we varied different acceptance tolerances across tasks on which REJABC performed sufficiently well. Our implementation of REJABC is quantile based, i.e,. we select a quantile of the samples with the smallest distance to the observed data, which implicitly defines an . The 10k samples needed for the comparison to the reference posterior samples are then resampled from the selected samples. In order to check whether this resampling significantly impaired performance, we alternatively fit a KDE in order to obtain 10k samples.
Below, we show results for different schedules of quantiles for each simulation budget, e.g., a schedule of 0.1, 0.01, 0.001 corresponds to the 10, 1 and 0.1 percent quantile, or the top 100 samples for each simulation budget. Across tasks and budgets the 0.1, 0.01, 0.001 quantile schedule performed best (Fig. 6). Performance showed improvement by the KDE fit, especially on the Gaussian tasks. We therefore report the version using the top 100 samples and KDE in the main paper.
h.2 SmcAbc
SMCABC has several hyperparameters including the population size, the perturbation kernel, the epsilon schedule and the distance metric. In order to ensure that we report the best possible SMCABC results for a fair comparison, we sweeped over three hyperparameters that are especially critical: the population size, the quantile used to select the epsilon from the distances of the particles of the previous population, and the scaling factor of the covariance of the Gaussian perturbation kernel. The remaining hyperparameters were fixed to values common in the literature: Gaussian perturbation kernel and norm distance metric.
Additionally, we compared our implementation against one from the popular pyABC toolbox (klinger2018) to which we refer as versions A and B respectively. We sweeped over these hyperparameters and optionally added a posthoc KDE fit for drawing the samples needed for twosample based performance metrics.
Overall, the parameter setting with a population size of 100, a kernel covariance scale of 0.5, and an epsilon quantile 0.2 performed best. Although the results of the two different implementations were qualitatively very similar (compare Fig. 7 and Fig. 8, respectively), version A was slightly better on the Gaussian tasks. Although we tried to match the implementations and the exact settings, there are small differences between the two, which might explain the difference in the results: Implementation B constructs the Gaussian perturbation kernel using kernel density estimation on the weighted samples of the previous population, whereas A constructs it using the mean and covariance estimated from samples from the previous population. The latter could be advantageous in case of a Gaussianlike (highdimensional) posterior (Gaussian Mixture and Gaussian linear task) and disadvantageous in a nonGaussianlike posteriors (e.g., Two Moons). We decided to report results for SMCABC in the main paper using implementation A (ours) with population size 100 for simulation budgets 1k and 10k, and population size 1000 for simulation budget 100k, a kernel covariance scale of 0.5, and epsilon quantile 0.2. This choice of kernel covariance scale is different from recommendations in the literature (sisson2007; beaumont2009). We only found very small performance differences for different scales and note that our choice is in line with the recommendation of the pyABC toolbox (pyabcKernel), i.e., using a scale between 0 and 1. Performance showed improvement by the KDE fit, especially on the Gaussian tasks. We therefore report the version with KDE in the main paper.
h.3 MCMC for (S)Nle and (S)Nre
(S)NLE and (S)NRE both rely on MCMC sampling, which has several hyperparameters. In line with papamakarios2019a and durkan2020, we used Slice Sampling (neal2003). However, we modified the MCMC schemes used in these papers and obtained significant improvements in performance and speed.
Number of chains and initialization. While papamakarios2019a; durkan2020 used a single chain with axisaligned updates, we found that on tasks with multimodal posteriors, it can be essential to run multiple MCMC chains in order to sample all modes. Performance on Two Moons, for example, was poor with a single chain, since usually only one of the crescent shapes was sampled. Rather than initialising chains by drawing initial locations from the prior, we found the resampling scheme as described in A.3 to work better for initialisation, and used 100 chains instead of a single one.
Transformation of variables. When implementing MCMC, it is common advice to transform problems to have unbounded support (hogg2017), although this has not been discussed in SBI papers or implemented in accompanying code. We found that without this transformation, MCMC sampling could get stuck in endless loops, e.g., on the LotkaVolterra task. Apart from the transformation to unbounded space, we found zscoring of parameters and data to be crucial for some tasks.
Vectorization of MCMC sampling. We reimplemented Slice Sampling so that all chains could perform likelihood evaluations in parallel. Evaluating likelihoods, e.g., in the case of (S)NLE, requires passes through a flowbased density estimator, which is significantly faster when batched. This allowed us to sample all chains in parallel rather than sequentially and yielded huge speedups: For example, SNLE on Gaussian Linear took more than 36 hours on average for 100k simulations without vectorization, and less than 2 hours with vectorization.
h.4 Density estimator for (S)Nle
Approaches based on neural networks (NN) tend to have many hyperparameters, including the concrete type of NN architecture and hyperparameters for training. We strove to keep our choices close to durkan2020, which are the defaults in the toolbox we used (sbi, sbi).
While papamakarios2019a; durkan2020 used Masked Autoregressive Flows (MAFs, papamakarios2017) for density estimation, we explored how results change when using Neural Spline Flows (NSFs, durkan2019neural) for density estimation. These results are shown in Fig. 9.
h.5 Density estimator for (S)Npe
We performed the analogous experiments for (S)NPE as for (S)NLE: Here, we found NSFs to increase performance relative to MAFs (Fig. 10). When directly estimating the posterior distribution, especially on tasks with complex multimodal structure like Two Moons or SLCP, the additional flexibility offered by NSFs improved performance. With NSFs, artifacts from density transformation that were visible e.g. in Two Moons posteriors, vanished. To our knowledge, results on (S)NPE with NSFs have not been previously published.
h.6 Classifier choice for (S)Nre
For (S)NRE, we compared two different choices of classifier architectures: an MLP and a ResNet architecture, as described in A.7. While results were similar for most tasks (Fig. 11), we decided to use the ResNet architecture in the main paper due to the better performance on Two Moons and SIR for low to medium simulation budgets.
Appendix M Metrics
m.1 Negative log probability of (Nltp)
In simulationbased inference, the average negative log likelihood of true parameters (NLTP) is commonly reported as a performance metric in the literature (papamakarios2016; durkan2018; papamakarios2019a; greenberg2019; hermans2019; durkan2020). An attractive property of this metric is that the access to the groundtruth posterior is not required.
It is important to point out, however, that calculating this metric on a single or small number of pairs is problematic. To illustrate the issue, consider the following example (as discussed in talts2018): Consider , and a single pair with and an implausible (but possible) . In this case, the true posterior is under which the has low probability since it is more than two standard deviations away from the posterior mean. If an algorithm fitted a wrong posterior, e.g., by overestimating the standard deviation as 1 instead of 0.5, the probability of under the estimated posterior would be higher than under the true posterior.
Therefore, a large number of pairs should be used. Indeed, in the limit of infinite number of pairs , the metric converges to a :
The first term in the final equation is the average between true and approximate posteriors over all observations that can be generated when sampling parameters from the prior. The second term, the entropy term, would be the same for all algorithms compared.
In the context of this benchmark, we decided against using the probability of as a metric: For all algorithms that are not amortized (all but one), evaluating posteriors at different would require rerunning inference. As the computational requirements for running the benchmark at 10 observations per task are already high, running tasks for hundreds of observations would become prohibitively expensive.
m.2 Simulationbased calibration (SBC)
In simulationbased calibration (SBC), samples are drawn from the dataaveraged posterior, i.e., the posterior obtained by running inference for many observations. When the posterior approximation is exact, is distributed according to the prior (talts2018).
Let us briefly illustrate this: In SBC, we draw , which implies a joint distribution . The marginal is then:
If the approximate posterior is the true posterior, the marginal on is equal to the prior: If , then , i.e., one can set up a consistency test that is based on the distribution of samples. talts2018 do this by using frequentist tests per dimension.
Note that SBC as described above is merely a consistency check. For example, if the approximate posterior were the prior, a calibration test as described above would not be able to detect this. This is a realistic failure mode in simulationbased inference. It could happen with rejection ABC in the limit , or when learned summary statistics have no information about . One way around this is issue is proposed in prangle2014diag, who propose to restrict observations to a subset of all possible .
SBC is similar to the average negative log likelihood of true parameters described above, in that inference needs to be carried out for many observations generated by sampling from the prior. Running inference for hundreds of observations would become prohibitively expensive in terms of compute for most algorithms, which is why we do not rely on SBC in the benchmark.
m.3 Median distance (MEDDIST)
Posterior predictive checks (PPCs) use the posterior predictive distribution to predict new data, . The observed data should look plausible under the posterior predictive distribution (gelman2004, chapter 6). A particular PPC, used for example in papamakarios2019a; greenberg2019; durkan2020, is to assess the median L2 distance between posterior predictive samples and . The median is used since the mean would be more sensitive to outliers.
In the benchmark, we refer to this metric as median distance (MEDDIST) and drew samples from each posterior predictive distribution to compute it. In contrast with other metrics considered here, the median distance is computed in the space of data and requires additional simulations (which could be expensive, depending on the simulator). The median distance should be considered a mere check rather than a metric and it does not necessarily test the structure of the estimated posterior.
m.4 Maximum Mean Discrepancy (MMD)
Maximum Mean Discrepancy (MMD) is an Integral Probability Metric (IPM). Linear and quadratic time estimates for using MMD as a twosample test were derived in gretton2012. MMD has been commonly used in the SBI literature with Gaussian kernels (papamakarios2019a; greenberg2019; hermans2019), setting a single lengthscale hyperparameter by using a median heuristic (ramdas2015). We follow the same procedure, i.e., use Gaussian kernels with lengthscale determined by the median heuristic on reference samples. MMDs are calculated using 10k samples from reference and approximate posteriors.
If simple kernels are used to compare distributions with complex, multimodal structure, distinct distributions can be mapped to nearby mean embeddings, resulting in low test power. On SLCP and Two Moons, for example, we found a translationinvariant kernel to be limiting, since it cannot adapt to the local structure (see Suppl. Fig. 4). This is reflected in the low correlation of MMD and C2ST (Suppl. Fig. 5). We emphasize that these issues are strictly related to simple kernels with hyperparameters commonly used in the literature. Posteriors of the Two Moons task have a structure similar to the blobs example of liu2020, who argue for using learned kernels to overcome the aforementioned problem.
m.5 Classifierbased tests (C2ST)
In classifierbased testing, a classifier is trained to distinguish samples of the true posterior from samples of the estimated posterior . If the samples are indistinguishable, the classification performance should be at chance level, 0.5. Practical use and properties of classifierbased 2sample testing (C2ST) are discussed in lopezpaz2018 (see gutmann2018likelihood; dalmasso2019a, for examples in the context of SBI).
To compute C2ST, we trained a twolayer neural network with 10 times as many ReLU units as the dimensionality of parameters, and optimize with Adam (kingma2014adam). Classifiers were trained on 10k zscored samples from reference and approximate posterior each. Classification accuracy was reported using 5fold crossvalidation.
m.6 Kernelized Stein Discrepancy (KSD)
Kernelized Stein Discrepancy (KSD) is a 1sample goodnessoffit test proposed independently by chwialkowski2016 and liu2016. KSD tests samples from algorithms against the gradient of unnormalized true posterior density, . We used KSD with Gaussian kernels, setting the lengthscale through the median heuristic, and 10k samples from each algorithm.
Appendix P Practical Advice
Based on our current results and understanding, we provide advice to practioners seeking to apply SBI. Note that our benchmark contains an initial selection out of a wide variety of SBI algorithms, see e.g. the recent review by cranmer2019 for an overview and additional advice. There is no onefitsall solution—which algorithm to use in practice will depend on the problem at hand. With this advice, we seek to provide a few useful questions to help identify which classes of algorithms may be applicable.
Do we need the Bayesian posterior, or is a point estimate sufficient?
Our focus here was on SBI algorithms that target the Bayesian posterior distribution, that is the set of parameters consistent with both the data and the prior distribution. If one only aims for a single estimate, optimization methods might be more efficient.
Is the simulator really ‘blackbox’?
The SBI algorithms presented in the benchmark can be applied to any ‘blackbox’ simulator (access to likelihoods, access to internal number generators, and differentiability are not required). If additional information is available, it is worth using algorithms that can leverage it. For example, if the likelihood is available, methods exploiting it (e.g. MCMC, variational inference) will generally be more efficient. Similarly, if one has access to the internal random numbers, probabilistic programming approaches such as inference compilation (le2016inference; baydin2019etalumis; wood2020) might be preferable. If additional quantities that characterize the latent process are available, i.e., the simulator is ‘graybox’, they can be used to augment training data and improve inference (brehmer2020a; cranmer2019).
What domain knowledge do we have about the problem?
For any practical application of SBI, it is worth thinking carefully about domain knowledge. First, knowledge about plausible parameters should inform the choice of the prior (e.g., positivevalued parameters in SIR and LotkaVolterra tasks). Second, domain knowledge can help design appropriate distance functions required, for example, for classical ABC algorithms. When using modelbased approaches, domain knowledge can potentially be built into the SBI algorithm itself, for example, by incorporating neural network layers with appropriate inductive biases, or hardcoding known invariances into the architecture. Finally, domain knowledge may also guide the design of summary statistics.
Do we have, or can we learn summary statistics?
Summary statistics are especially important when facing problems with highdimensional data: It is important to point out that the posterior given summary statistics is only equivalent to if the summary statistics are sufficient. The problem at hand can guide the manual design of summary statistics that are regarded particularly important or informative. Alternatively, many automatic approaches to construct summary features exist (e.g., prangle2014semi; charnock2018; dinev2018), and this is an active area of research. For instance, chen2020neural recently proposed a new approach to learn approximately sufficient statistics for (S)ABC and (S)NLE approaches and discuss that many existing approaches do not guarantee (global) sufficiency for all . When using (S)NPE or (S)NRE, highdimensional data can be reduced as part of the network architecture.
Do we have lowdimensional data and parameters, and a cheap simulator?
If both the parameters and the data (or suitable summarystatistics thereof) are lowdimensional, and a very large number of simulations can be generated, modelfree algorithms such as classical ABC algorithms can be competitive. These have the benefit of adding little computational overhead, especially compared to neural networkbased approaches.
Are the data or the parameters higher dimensional?
For limited simulation budget, and/or higher dimensionalities, approaches that train a model of the likelihood, posterior, or likelihood ratio, will generally be preferable over modelfree algorithms such as (S)ABC. Overall, we did not find strong differences between the these three modelbased approaches in terms of estimationaccuracy across the current benchmark tasks. For tasks in which the data dimensionality was much higher than the dimensionality of parameters, we found algorithms that can automatically learn informative summary statistics of the data, e.g., (S)NPE or (S)NRE to be more efficient. Conversely, if the data are of lower dimensionality, likelihoodtargeting methods (e.g. (S)NLE) might perform better.
Are simulations expensive? Can we simulate online?
For timeintensive and complex simulators, it can be beneficial to use sequential methods to increase sample efficiency: We found that sequential schemes generally outperformed nonsequential ones. While we focused on simple strategies which use the previous estimate of the posterior to propose new parameters, more sophisticated schemes (e.g., gutmann2015; lueckmann2019; jarvenpaa2019) may increase sample efficiency if only few simulations can be obtained. For some applications, inference is performed on a fixed dataset, and one cannot resort to sequential algorithms.
Do we want to do inference once, or repeatedly?
If we want to do SBI separately for many different data points (i.e. compute , methods that allow ‘amortization’ (NPE) are likely preferable. While NLE and NRE allow amortisation of the neural network, MCMC sampling is required, which takes additional time. Conversely, if we want to run SBI conditioned on many i.i.d. data (e.g. ) methods based on likelihood estimation (NLE), likelihoodratio estimation (NRE), or NPE with exchangeable neural networks (chan2018) would be appropriate.
Appendix R Runtimes
In applications of SBI, simulations are commonly assumed to be the dominant cost. In order to make the benchmark feasible at this scale, we focused on simple simulators and optimized runtimes, e.g. we developed a new package bridging DifferentialEquations.jl (rackauckas2017; bezanson2017julia) and PyTorch (paszke2019) so that generating simulations for all implemented tasks is extremely fast. This differs from many cases in practice, where the runtime costs for an algorithm are often negligible compared to the cost of simulations. Having said that, algorithms show significant differences in runtime costs, which we measured and report here.
We recorded runtimes for all algorithms on all tasks. In principle, runtimes could be reduced by employing multiCPU architectures, however, we decided for the single CPU setup to accurately compare runtimes across all algorithms and tasks. We did not employ GPUs for training neuralnetworks (NN). This is because the type of NNs used in the algorithms currently in the benchmark do not benefit much from GPU versus CPU training (e.g., no CNN architecture, rather shallow and narrow networks). In fact, running SNPE on SLCP using a GeForce GTX 1080 showed slightly longer runtimes than on CPU, due to the added overhead resulting from copying data back and forth to the device. Therefore, it was more economical and comparable to run the benchmark on CPUs.
All neural networkbased algorithms were run on single 3.6 GHz CPU cores of AWS C5instances. ABC algorithms were run on single CPU cores of an internal cluster with 2.4 GHz CPUs. We observed a difference in runtimes of less than 100ms when running ABC algorithms on the same hardware as used for neural networkbased algorithms.
Figure 12 shows the recorded runtimes in minutes. We observed short runtimes for REJABC and SMCABC, as these do not require NN training or MCMC. The sequential versions of all three NNbased algorithms yielded longer runtimes than the nonsequential versions because these involve 10 rounds of NN training. Among the sequential algorithms, SNPE showed the longest runtimes. Runtimes with MAFs instead of NSFs tend to be faster, e.g. the difference between MAFs and NSFs using SNPE on SLCP at 100k simulations was about 50 minutes on average. We also emphasize that the speed of (S)NLE reported here was only obtained after vectorizing MCMC sampling. Without vectorization, runtime on the Gaussian Linear for SNLE was more than 36 hours instead of less than 2 hours (see Appendix H).
Appendix T Tasks
t.1 Gaussian Linear
Inference of the mean of a 10d Gaussian model, in which the covariance is fixed. The (conjugate) prior is Gaussian:
Prior  

Simulator 


Dimensionality 
t.2 Gaussian Linear Uniform
Inference of the mean of a 10d Gaussian model, in which the covariance is fixed. The prior is uniform:
Prior  

Simulator 


Dimensionality 
t.3 Slcp
A challenging inference task designed to have a simple likelihood and a complex posterior. The prior is uniform over five parameters and the data are a set of four twodimensional points sampled from a Gaussian likelihood whose mean and variance are nonlinear functions of :
Prior  

Simulator 


Dimensionality  
References 

t.4 SLCP with Distractors
This task is similar to T.3, with the difference that we add uninformative dimensions (distractors) to the observation:
Prior  

Simulator 


Dimensionality  
References  greenberg2019 
t.5 Bernoulli GLM
Inference of a 10parameter Generalized linear model (GLM) with Bernoulli observations, and Gaussian prior with covariance matrix which encourages smoothness by penalizing the secondorder differences in the vector of parameters (DeNicolao1997). The observations are the sufficient statistics for this GLM:
Prior 



Simulator 


Dimensionality  
Fixed parameters 


References  lueckmann2017; gonccalves2019training 
t.6 Bernoulli GLM Raw
This task is similar to T.5, the sole difference being that the observations are not the sufficient statistics for the Bernoulli GLM process but the raw observations:
Prior 



Simulator 


Dimensionality  
Fixed parameters  Duration of task . 
t.7 Gaussian Mixture
This task is common in the ABC literature. It consists of inferring the common mean of a mixture of two twodimensional Gaussian distributions, one with much broader covariance than the other:
Prior  

Simulator 


Dimensionality  
References  sisson2007; beaumont2009; toni2009; simola2020 
t.8 Two Moons
A twodimensional task with a posterior that exhibits both global (bimodality) and local (crescent shape) structure to illustrate how algorithms deal with multimodality:
Prior  

Simulator 
