ABCCDE: Towards Approximate Bayesian Computation with Complex HighDimensional Data and Limited Simulations
Abstract
Approximate Bayesian Computation (ABC) is typically used when the likelihood is either unavailable or intractable but where data can be simulated under different parameter settings using a forward model. Despite the recent interest in ABC, highdimensional data and costly simulations still remain a bottleneck in some applications. There is also no consensus as to how to best assess the performance of such methods without knowing the true posterior. We show how a nonparametric conditional density estimation (CDE) framework, which we refer to as ABCCDE, help address three nontrivial challenges in ABC: (i) how to efficiently estimate the posterior distribution with limited simulations and different types of data, (ii) how to tune and compare the performance of ABC and related methods in estimating the posterior itself, rather than just certain properties of the density, and (iii) how to efficiently choose among a large set of summary statistics based on a CDE surrogate loss. We provide theoretical and empirical evidence that justify ABCCDE procedures that directly estimate and assess the posterior based on an initial ABC sample, and we describe settings where standard ABC and regressionbased approaches are inadequate.
Key Words: nonparametric methods, conditional density estimation, approximate Bayesian computation, likelihoodfree inference
1 Introduction
For many statistical inference problems in the sciences the relationship between the parameters of interest and observable data is complicated, but it is possible to simulate realistic data according to some model; see Beaumont (2010); Estoup et al. (2012) for examples in genetics, and Cameron and Pettitt (2012); Weyant et al. (2013) for examples in astronomy. In such situations, the complexity of the data generation process often prevents the derivation of a sufficiently accurate analytical form for the likelihood function. One cannot use standard Bayesian tools as no analytical form for the posterior distribution is available. Nevertheless one can estimate , the posterior distribution of the parameters given data , by taking advantage of the fact that it is possible to forward simulate data under different settings of the parameters . Problems of this type have motivated recent interest in methods of likelihoodfree inference, which includes methods of Approximate Bayesian Computation (ABC; Marin et al. 2012)
Despite the recent surge of approximate Bayesian methods, several challenges still remain. In this work, we present a conditional density estimation (CDE) framework and a surrogate loss function for CDE that address the following three problems:

how to efficiently estimate the posterior density , where is the observed sample; in particular, in settings with complex, highdimensional data and costly simulations,

how to choose tuning parameters and compare the performance of ABC and related methods based on simulations and observed data only; that is, without knowing the true posterior distribution, and

how to best choose summary statistics for ABC and related methods when given a very large number of candidate summary statistics.
Existing Methodology. There is an extensive literature on ABC methods; we refer the reader to Marin et al. (2012); Prangle et al. (2014) and references therein for a review. The connection between ABC and CDE has been noted by others; in fact, ABC itself can be viewed as a hybrid between nearest neighbors and kernel density estimators (Blum, 2010; Biau et al., 2015). As Biau et al. point out, the fundamental problem from a practical perspective is how to select the parameters in ABC methods in the absence of a priori information regarding the posterior . Nearest neighbors and kernel density estimators are also known to perform poorly in settings with a large amount of summary statistics (Blum, 2010), and they are difficult to adapt to different data types (e.g., mixed discretecontinuous statistics and functional data). Few works attempt to use other CDE methods to estimate posterior distributions. At the time of submission of this paper, the only works in this direction are Papamakarios and Murray (2016) and Lueckmann et al. (2017), which are based on conditional neural density estimation, Fan et al. (2013) and Li et al. (2015), which use a mixture of Gaussian copulas to estimate the likelihood function, and Raynal et al. (2017), which suggests random forests for quantile estimation.
Although the above mentioned methods utilize specific CDE models to estimate posterior distributions, they do not fully explore other
advantages of a CDE framework; such as, in methods assessment, in variable selection,
and in
tuning the final estimates
with CDE as a goal (see Sections 2.2, 2.3
and 4).
Summary statistics selection is indeed a
nontrivial challenge in likelihoodfree inference:
ABC methods
depend on the choice of statistics and distance function of observables when comparing observed and simulated data, and using the “wrong” summary statistics can dramatically affect their performance. For a general review of dimension reduction methods for ABC, we refer the reader to Blum
et al. (2013),
who classify current approaches in three classes: (a) best subset selection approaches, (b) projection techniques, and (c)
regularization techniques. Many of these approaches still face either significant computational issues or attempt to find good summary statistics for certain characteristics of the posterior rather than the entire posterior itself.
For instance,
Creel and
Kristensen (2016) propose a best subset selection of summary statistics based on improving the estimate
of the posterior mean . There are no guarantees however that statistics that
lead to good estimates of will
be sufficient for or even yield reasonable estimates of .
Moreover, the current literature on likelihoodfree inference lacks methods that allow
one to directly compare the performance of different posterior distribution estimators.
Given a collection of estimates
(obtained by, e.g., ABC methods with different
tolerance levels, sampling techniques, and so on), an open problem is to how to select the
estimate that is closest to the true posterior density for observed data .
Some goodnessoffit techniques
have been proposed (for example, Prangle
et al. 2014 compute the goodness of fit based on coverage properties),
but although diagnostic tests are useful, they do not capture all aspects of the density estimates. Some density estimates which are not close to the true density can pass all tests (Breiman, 2001; Bickel
et al., 2006), and the situation is even worse in conditional density estimation. Figure 1 shows a toy example where both probabilityprobability (PP) and coverage plots wrongly indicate an excellent fit,
In this paper, we show how one can improve ABC methods with a novel CDE surrogate loss function (Eq. 3) that measures how well one estimates the entire posterior distribution ; see Section 2.2 for a discussion of its theoretical properties. Our proposed method, ABCCDE, starts with a rough approximation from an ABC sampler and then directly estimates the conditional density exactly at the point using a nonparametric conditional density estimator. Unlike other ABC postadjustment techniques in the literature (e.g. Beaumont et al. (2002) and Blum and François (2010)), our method is optimized for estimating posteriors, and corrects for changes in the ABC posterior sample beyond the posterior mean and variance. We also present a general framework (based on CDE) that can handle different types of data (including functional data, mixed variables, structured data, and so on) as well as a larger number of summary statistics. With, for example, FlexCode (Izbicki and Lee, 2017) one can convert any existing regression estimator to a conditional density estimator. Recent neural mixture density networks that directly estimate posteriors for complex data (Papamakarios and Murray, 2016; Lueckmann et al., 2017) also fit into this ABCCDE framework, and are especially promising for image data. Hence, with ABCCDE, we take a different approach to address the curse of dimensionality than in traditional likelihoodfree inference methods. In standard ABC, it is essential to choose (a smaller set of) informative summary statistics to properly measure userspecified distances between observed and simulated data. The main dimension reduction in our framework is implicit in the conditional density estimation and our CDE loss function. Depending of the choice of estimator, we can adapt to different types of sparse structure in the data, and just as in highdimensional regression, handle a large amount of covariates, even without relying on a prior dimension reduction of the data and a userspecified distance function of observables.
Finally, we note that ABC summary statistic selection and goodnessoffit techniques are typically designed to estimate posterior distributions accurately for every sample . In reality, we often only care about estimates for the particular sample that is observed, and even if a method produces poor estimates for some it can still produce good estimates for . The methods we introduce in this paper take this into consideration, and directly aim at constructing, evaluating and tuning estimators for the posterior density at the observed value .
The organization of the paper is as follows: Section 2 describes and presents theoretical results for how a CDE framework and a surrogate loss function address issues (i)–(iii). Section 3 includes simulated experiments that demonstrate that our proposed methods work in practice. In Section 4, we revisit CDE in the context of ABC and demonstrate how direct estimation of posteriors with CDE and the surrogate loss can replace further iterations with standard ABC. We then end by providing links to generalpurpose CDE software that can be used in likelihoodfree inference in different settings. We refer the reader to the Appendix for proofs, a comparison to postprocessing regression adjustment methods, and two applications in astronomy.
2 Methods
In this section we propose a CDE framework for (i) estimating the posterior density (Section 2.1), (ii) comparing the performance of ABC and related methods (Section 2.2), and (iii) choosing optimal summary statistics (Section 2.3).
2.1 Estimating the Posterior Density via CDE
Given a prior distribution and a likelihood function , our goal is to compute the posterior distribution , where is the observed sample. We assume we know how to sample from for a fixed value of .
A naive way of estimating via CDE methods is to first generate an i.i.d. sample by sampling and then for each pair. One applies the CDE method of choice to , and then simply evaluates the estimated density at . The naive approach however may lead to poor results because some are far from the observed data . To put it differently, standard conditional density estimators are designed to estimate for every , but in ABC applications we are only interested in .
To solve this issue, one can instead estimate using a training set that only consists of sample points close to . This training set is created by a simple ABC rejection sampling algorithm. More precisely: for a fixed distance function (that could be based on summary statistics) and tolerance level , we construct a sample according to Algorithm 1.
To this new training set , we then apply our conditional density estimator, and finally evaluate the estimate at . This procedure can be regarded as an ABC postprocessing technique (Marin et al., 2012): the first (ABC) approximation to the posterior is obtained via the sample , which can be seen as a sample from . That is, the standard ABC rejection sampler is implicitly performing conditional density estimation using an i.i.d. sample from the joint distribution of the data and the parameter. We take the results of the ABC sampler and estimate the conditional density exactly at the point using other forms of conditional density estimation. If done correctly, the idea is that we can improve upon the original ABC approximation even without, as is currently the norm, simulating new data or decreasing the tolerance level .
Remark 1.
For simplicity, we focus on standard ABC rejection sampling, but one can use other ABC methods, such as sequential ABC (Sisson et al., 2007) or population Monte Carlo ABC (Beaumont et al., 2009), to construct . The data can either be the original data vector, or a vector of summary statistics. We revisit the issue of summary statistic selection in Section 2.3.
Next, we review FlexCode (Izbicki and Lee, 2017), which we currently use as a generalpurpose methodology for estimating . However, many aspects of the paper (such as the novel approach to method selection without knowledge of the true posterior) hold for other CDE and ABC methods as well. In Section 4, for example, we use our surrogate loss to choose the tuning parameters of a nearestneighbors kernel density estimator (Equation 14), which includes ABC as a special case.
FlexCode as a “PlugIn” CDE Method. For simplicity,
assume that we are interested in estimating the posterior distribution of a single parameter
, even if there are several parameters in the problem.
The key idea of FlexCode is to notice that, if for every , then it is possible to expand as where the expansion coefficients are given by
(1) 
That is, each is a regression function. The FlexCode estimator is defined as where are regression estimates. The cutoff in the series expansion is a tuning parameter that controls the bias/variance tradeoff in the final density estimate, and which we choose via data splitting (Section 2.2).
With FlexCode, the problem of highdimensional conditional density estimation boils down to choosing appropriate methods for estimating the regression functions . The key advantage of FlexCode is that it offers more flexible CDE methods: By taking advantage of existing regression methods, which can be “plugged in” into the CDE estimator, we can adapt to the intrinsic structure of highdimensional data (e.g., manifolds, irrelevant covariates, and different relationships between and the response ), as well as handle different data types (e.g., mixed data and functional data) and massive data sets (by using, e.g., xgboost (Chen and Guestrin, 2016)). See Izbicki and Lee (2017) and the upcoming LSSTDESC photoz DC1 paper for examples. An implementation of FlexCode that allows for wavelet bases can be found at https://github.com/rizbicki/FlexCoDE (R; R Core Team 2013) and https://github.com/tpospisi/flexcode (Python).
2.2 Method Selection: Comparing Different Estimators of the Posterior
Definition of a Surrogate Loss. Ultimately, we need to be able to decide which approach is best for approximating without knowledge of the true posterior. Ideally we would like to find an estimator such that the integrated squarederror (ISE) loss
(2) 
is small. Unfortunately, one cannot compute without knowing the true , which is why method selection is so hard in practice. To overcome this issue, we propose the surrogate loss function
(3) 
which enforces a close fit in an neighborhood of . Here, the denominator is simply a constant that makes a proper density in .
The advantage with the above definition is that we can directly estimate from the ABC posterior sample. Indeed, it holds that can be written as
(4) 
where is a random vector with distribution induced by a sample generated according to the ABC rejection procedure in Algorithm 1; and is a constant that does not depend on the estimator . It follows that, given an independent validation or test sample of size of the ABC algorithm, , we can estimate (up to the constant ) via
(5) 
When given a set of estimators , we select the method with the smallest estimated surrogate loss,
Example 1 (Model selection based on CDE surrogate loss versus regression MSE loss).
Suppose we wish to estimate the posterior distribution of the mean of a Gaussian distribution with variance one. The left plot of Figure 2 shows the performance of
a nearestneighbors kernel density estimator
(Equation 14) with the kernel bandwidth and the number of nearest neighbors chosen via (i) the estimated surrogate loss of Equation 5
versus (ii) a standard regression meansquarederror loss.
Properties of the Surrogate Loss. Next we investigate the conditions under which the estimated surrogate loss is close to the true loss; the proofs can be found in Appendix. The following theorem states that, if is a smooth function of , then the (exact) surrogate loss is close to for small values of .
Theorem 1.
Assume that, for every ,
satisfies the Hölder condition of order with a constant
The next theorem shows that the estimator in Equation 5 does indeed converge to the true loss .
Under some additional conditions, it is also possible to guarantee that not only the estimated surrogate loss is close to the true loss, but that the result holds uniformly for a finite class of estimators of the posterior distribution. This is formally stated in the following theorem.
Theorem 3.
Let be a set of estimators of .
Assume that there exists such that for every , , and .
The next corollary shows that the procedure we propose in this section, with high probability, picks an estimate of the posterior density that has a true loss that is close to the true loss of the best method in .
Corollary 1.
Let be the best estimator in according to the estimated surrogate loss, and let be the best estimator in according to the true loss. Then, under the assumptions from Theorem 6, with probability at least ,
2.3 Summary Statistics Selection
In a typical ABC setting, there are a large number of available summary statistics. Standard ABC fails if all of them are used simultaneously, especially if some statistics carry little information about the parameters of the model (Blum, 2010).
One can use
ABCCDE
as a way of either (i) directly estimating
when there are a large number of summary statistics,
There are two versions of
ABCCDE that are particularly useful for variable selection:
FlexCodeSAM
More generally, one can use FlexCode to compute an importance measure for summary statistics (to be used in other procedures than FlexCode). It turns out that one can infer the relevance of the :th summary statistic in posterior estimation from its relevance in estimating the first regression functions in FlexCode — even if we do not use FlexCode for estimating the posterior. More precisely, assume that is a vector of summary statistics, and let . Define the relevance of variable to the posterior distribution as
and its relevance to the regression in Equation 1 as
Under some smoothness assumptions with respect to , the two metrics are related.
Assumption 1 (Smoothness in direction).
, we assume that
the
Sobolev space of order and radius ,
Proposition 1.
Under Assumption 2,
Now let denote a measure of importance of the :th summary statistic in estimating regression (Equation 1). For instance, for FlexCodeRF, may represent the mean decrease in the Mean Squared Error (Hastie et al., 2001); for FlexCodeSAM, may be value of the indicator function for the :th summary statistic when estimating . Motivated by Proposition 2, we define an importance measure for the :th summary statistic in posterior estimation according to
(6) 
We can use these values to select variables for estimating via other ABC methods. For example, one approach is to choose all summary statistics such that , where the threshold value is defined by the user. We will further explore this approach in Section 3.
In summary, our procedure has two main advantages compared to current stateoftheart approaches for selecting summary statistics in ABC: (i) it chooses statistics that lead to good estimates of the entire posterior distribution rather than surrogates, such as, the regression or posterior mean (Aeschbacher et al., 2012; Creel and Kristensen, 2016; Faisal et al., 2016), and (ii) it is typically faster than most other approaches; in particular, it is significantly faster than best subset selection which scales as , whereas, e.g., FlexCodeRF scales as , and FlexCodeSAM scales as .
3 Experiments
3.1 Examples with Known Posteriors
We start by analyzing examples with wellknown and analytically computable posterior distributions:

1. Mean of a Gaussian with known variance. , . We repeat the experiments for in an equally spaced grid with ten values between 0.5 and 100.

2. Precision of a Gaussian with unknown precision. , . We set , , and repeat the experiments choosing and such that and is in an equally spaced grid with ten values between 0.1 and 5.
In the Appendix we also investigate a third setting, “Mean of a Gaussian with unknown precision”, with results similar to those shown here in the main manuscript.
In all examples here, observed data are drawn from a distribution. We run each experiment 200 times, that is, with 200 different values of . The training set , which is used to build conditional density estimators, is constructed according to Algorithm 1 with and a tolerance level that corresponds to an acceptance rate of 1%. For the distance function , we choose the Euclidean distance between minimal sufficient statistics normalized to have mean zero and variance 1; these statistics are for scenario 1 and for scenario 2. We use a Fourier basis for all FlexCode experiments in the paper, but wavelets lead to similar results.
We compare the following methods (see Section 4 and the appendix for a comparison between ABC, regression adjustment methods and ABCCDE with a standard kernel density estimator):

ABC: rejection ABC method with the minimal sufficient statistics (that is, apply a kernel density estimator to the coordinate of , with bandwidth chosen via crossvalidation),

FlexCode_RawNN: FlexCode estimator with Nearest Neighbors regression,

FlexCode_RawSeries: FlexCode estimator with Spectral Series regression (Lee and Izbicki, 2016), and

FlexCode_RawRF: FlexCode estimator with Random Forest regression.
The three FlexCode estimators (denoted by “Raw”) are directly applied to the sorted values of the original covariates . That is, we do not use minimal sufficient statistics or other summary statistics. To assess the performance of each method, we compute the true loss (Equation 2) for each . In addition, we estimate the surrogate loss according to Equation 5 using a new sample of size from Algorithm 1.
CDE and Method Selection
In this section, we investigate whether various ABCCDE methods improve upon standard ABC for the settings described above. We also evaluate the method selection approach in Section 2.2 by comparing decisions based on estimated surrogate losses to those made if one knew the true ISE losses.
Figure (h)h, left, shows how well the methods actually estimate the posterior density for Settings 12. Panel (a) and (e) list the proportion of times each method returns the best results (according to the true loss from Equation 2). Generally speaking, the larger the prior variance, the better ABCCDE methods perform compared to ABC. In particular, while for small variances ABC tends to be better, for large prior variances, FlexCode with Nearest Neighbors regression tends to give the best results. FlexCode with expansion coefficients estimated via Spectral Series regression is also very competitive. Panels (c) and (g) confirms these results; here we see the average true loss of each method along with standard errors.
Figure (h)h, right, summarizes the performance of our method selection algorithm. Panels (b) and (f) list the proportion of times the method chosen by the true loss (Equation 2) matches the method chosen via the estimated loss (Equation 5) in all pairwise comparisons; that is, the plot tells us how often the method selection procedure proposed in Section 2.2 actually works. We present two variations of the algorithm: in the first version (see triangles), we include all the data; in the second version (see circles), we remove cases where the confidence interval for contains zero (i.e., cases where we cannot tell whether or performs better). The baseline shows what one would expect if the method selection algorithm was totally random. The plots indicate that we, in all settings, roughly arrive at the same conclusions with the estimated surrogate loss as we would if we knew the true loss.
For the sake of illustration, we have also added panels (d) and (h), which show a scatterplot of differences between true losses versus the differences between the estimated losses for ABC and FlexCode_RawNN for the setting with and , respectively. The fact that most samples are either in the first or third quadrant further confirms that the estimated surrogate loss is in agreement with the true loss in terms of which method best estimates the posterior density.
Summary Statistic Selection
In this Section we investigate the performance of FlexCodeRF for summary statistics selection (Sec. 2.3). For this purpose, the following summary statistics were used:

Mean: average of the data points;

Median: median of the data points;

Mean 1: average of the first half of the data points;

Mean 2: average of the second half of the data points;

SD: standard deviation of the data points;

IQR: interquartile range of the data points;

Quartile 1: first quantile of the data points;

Independent random variables , that is, random noise
Figure (d)d summarizes the results of fitting FlexCodeRF and ABC to these summary statistics for the different scenarios. Panel (a) and (c) show the true loss as we increase the number of statistics. More precisely: the values at represent the true loss of ABC (left) and FlexCodeRF (right) when using only the mean (i.e., the first statistic); the points at indicate the true loss of the estimates using only the mean and the median (i.e., the first and second statistics) and so on. We note that FlexCodeRF is robust to irrelevant summary statistics: the method virtually behaves as if they were not present. This is in sharp contrast with standard ABC, whose performance deteriorates quickly with added noise or nuisance statistics.
Furthermore, panels (b) and (d) show the average importance of each statistic, defined according to Equation 6, where is the mean decrease in the Gini index. These plots reveal that FlexCodeRF typically assigns a high score to sufficient summary statistics or to statistics that are highly correlated to sufficient statistics. For instance, in panel (b) (estimation of the mean of the distribution), measures of location are assigned a higher importance score, whereas measures of dispersion are assigned a higher score in panel (d) (estimation of the precision of the distribution). In all examples, FlexCodeRF assigns zero importance to random noise statistics. We conclude that our method for summary statistic selection indeed identifies relevant statistics for estimating the posterior well.
4 Approximate Bayesian Computation and Beyond
In this section, we show how one can use our surrogate loss to choose the tuning parameters in standard ABC with a nearest neighbors kernel smoother.
4.1 ABC with Fewer Simulations
As noted by (Blum, 2010; Biau et al., 2015), ABC is equivalent to a kernelCDE. More specifically, it can be seen as a “nearestneighbors” kernelCDE (NNKCDE) defined by
(7) 
where represents the index of the th nearest neighbor to the target point in covariate space, and we compute the conditional density of at by applying a kernel smoother with bandwidth to the points closest to .
For a given set of generated data, the above is equivalent to selecting the ABC threshold as the th quantile of the observed distances. This is commonly used in practice as it is more convenient than determining a priori. However, as pointed out by Biau et al. (2015) (Section 4; remark 1), there is currently no good methodology to select both and in an ABC knearest neighbor estimate.
Given the connection between ABC and NNKCDE, we propose to use our surrogate loss to tune the estimator; selecting and such that they minimize the estimated surrogate loss in Equation 5. In this sense, we are selecting the “optimal” ABC parameters after generating some of the data: having generated 10,000 points, it may turn out that we would have preferred a smaller tolerance level and that only using the closest 1,000 points would better approximate the posterior.
Example with Normal Posterior. To demonstrate the effectiveness of our surrogate loss in reducing the number of simulations, we draw data where . We examine the role of ABC thresholds by fitting the model for several values of the threshold with observed data . (A similar example with a twodimensional normal distribution can be found in the Appendix.) For each threshold, we perform rejection sampling until we retain ABC points. We select ABC thresholds to fix the acceptance rate of the rejection sampling. Those acceptance rates are then used in place of the actual tolerance level for easier comparison.
The left panel of Figure 5 shows examples of posterior densities for varying acceptance rates. For the highest acceptance rate of 1 (corresponding to the ABC tolerance level ), the ABC posterior (top left) is the prior distribution and thus a poor estimate. In contrast, the two ABCCDE methods (FlexCodeNN and NNKCDE) have a decent performance even at an acceptance rate of 1; more generally, they perform well at a higher acceptance rate than standard ABC.
To corroborate this qualitative look, we examine the loss for each method. The right panel of Figure 5 plots the true and surrogate losses against the acceptance rate for the given methods. As seen in Section 3.1.1, the surrogate loss provides the same conclusion as the (unavailable in practice) true loss. As the acceptance rate decreases, the ABC realizations more closely approximate the true posterior and the ABC estimate of the posterior improves. The main result is that NNKCDE and FlexCodeNN have roughly constant performance over all values of the acceptance rate. As such, we could generate only 1,000 realizations of the ABC sample at an acceptance rate of 1 and achieve similar result as standard ABC generating 100,000 values at an acceptance rate of 0.01.
There are two different sources of improvement: the first exhibited by NNKCDE amounts to selecting the “optimal” ABC parameters and using surrogate loss. However, as FlexCodeNN performs slightly better than NNKCDE for the same sample, there is an additional improvement in using CDE methods other than kernel smoothers; this difference becomes more pronounced for highdimensional and complex data (see Izbicki and Lee 2017 for examples of when traditional kernel smoothers fail).
5 Conclusions
In this work, we have demonstrated three ways in which our conditional estimation framework can improve upon approximate Bayesian computational methods for nextgeneration complex data and simulations.
First, realistic simulation models are often such that the computational cost of generating a single sample is large, making lower acceptance ratios unrealistic.
Secondly, our ABCCDE framework allows one to compare ABC and related methods in a principled way, making it possible to pick the best method for a given data set without knowing the true posterior. Our approach is based on a surrogate loss function and data splitting. We note that a related crossvalidation procedure to choose the tolerance level in ABC has been proposed by Csillery et al. (2012), albeit using a loss function that is appropriate for point estimation only.
Finally, when dealing with complex models, it is often difficult to know exactly what summary statistics would be appropriate for ABC. Nevertheless, the practitioner can usually make up a list of a large but redundant number of candidate statistics, including statistics generated with automatic methods. As our results show, FlexCodeRF (unlike ABC) is robust to irrelevant statistics. Moreover, FlexCode, in combination with RF for regression, offers a way of evaluating the importance of each summary statistic in estimating the full posterior distribution; hence, these importance scores could be used to choose relevant summary statistics for ABC and any other method used to estimate posteriors.
In brief, there are really two estimation problems in ABCCDE: The first is that of estimating . ABCCDE starts with a rough approximation from an ABC sampler and then directly estimates the conditional density exactly at the point using a nonparametric conditional density estimator. The second is that of estimating the integrated squared error loss (Eq. 2). Here we propose a surrogate loss that weights all points in the ABC posterior sample equally, but a weighted surrogate loss could potentially return more accurate estimates of the ISE. For example, Figures 9 (left) and 10 in the Appendix show that NNKCDE perform better than ABC postprocessing techniques. The current estimated loss, however, cannot identify a difference in ISE loss between NNKCDE and “Blum” because of the rapidly shifting posterior in the vicinity of .
Acknowledgments. We are grateful to Rafael Stern for his insightful comments on the manuscript. We would also like to thank Terrence Liu, Brendan McVeigh, and Matt Walker for the NFW simulation code and to Michael Vespe for help with the weak lensing simulations in the Appendix. This work was partially supported by NSF DMS1520786, Fundação de Amparo à Pesquisa do Estado de São Paulo (2017/033638) and CNPq (306943/20174).
Links to Nonparametric Software Optimized for CDE:

NNKCDE: https://github.com/tpospisi/NNKCDE (see Appendix D)

RFCDE: https://github.com/tpospisi/rfcde (Pospisil and Lee, 2018)
Appendix A Proofs
a.1 Results on the surrogate loss
Theorem 4.
Assume that, for every ,
satisfies the Hölder condition of order with a constant
Proof.
First, notice that
It follows that
∎
(8) 
Proof.
Using the triangle inequality,
where the last inequality follows from Theorem 4 and the fact that is an average of iid random variables. ∎
Lemma 1.
Assume there exists such that for every and . Then
Proof.
Notice that
where , with iid. The conclusion follows from Hoeffding’s inequality and the fact that ∎
Lemma 2.
Under the assumptions of Lemma 1 and if satisfies the Hölder condition of order with constants such that ,
Proof.
Theorem 6.
Let be a set of estimators of .
Assume there exists such that for every , , and .
Proof.
The theorem follows from Lemma 2 and the union bound. ∎
Corollary 2.
Let be the best estimator in according to the estimated surrogate loss, and let be the best estimator in according to the true loss. Then, under the assumptions from Theorem 6, with probability at least ,