ABC-CDE: Towards Approximate Bayesian Computation with Complex High-Dimensional Data and Limited Simulations

# ABC-CDE: Towards Approximate Bayesian Computation with Complex High-Dimensional Data and Limited Simulations

## Abstract

Approximate Bayesian Computation (ABC) is typically used when the likelihood is either unavailable or intractable but where data can be simulated under different parameter settings using a forward model. Despite the recent interest in ABC, high-dimensional data and costly simulations still remain a bottleneck in some applications. There is also no consensus as to how to best assess the performance of such methods without knowing the true posterior. We show how a nonparametric conditional density estimation (CDE) framework, which we refer to as ABC-CDE, help address three nontrivial challenges in ABC: (i) how to efficiently estimate the posterior distribution with limited simulations and different types of data, (ii) how to tune and compare the performance of ABC and related methods in estimating the posterior itself, rather than just certain properties of the density, and (iii) how to efficiently choose among a large set of summary statistics based on a CDE surrogate loss. We provide theoretical and empirical evidence that justify ABC-CDE procedures that directly estimate and assess the posterior based on an initial ABC sample, and we describe settings where standard ABC and regression-based approaches are inadequate.

Key Words: nonparametric methods, conditional density estimation, approximate Bayesian computation, likelihood-free inference

\xpatchcmd

## 1 Introduction

For many statistical inference problems in the sciences the relationship between the parameters of interest and observable data is complicated, but it is possible to simulate realistic data according to some model; see Beaumont (2010); Estoup et al. (2012) for examples in genetics, and Cameron and Pettitt (2012); Weyant et al. (2013) for examples in astronomy. In such situations, the complexity of the data generation process often prevents the derivation of a sufficiently accurate analytical form for the likelihood function. One cannot use standard Bayesian tools as no analytical form for the posterior distribution is available. Nevertheless one can estimate , the posterior distribution of the parameters given data , by taking advantage of the fact that it is possible to forward simulate data under different settings of the parameters . Problems of this type have motivated recent interest in methods of likelihood-free inference, which includes methods of Approximate Bayesian Computation (ABC; Marin et al. 2012)

Despite the recent surge of approximate Bayesian methods, several challenges still remain. In this work, we present a conditional density estimation (CDE) framework and a surrogate loss function for CDE that address the following three problems:

1. how to efficiently estimate the posterior density , where is the observed sample; in particular, in settings with complex, high-dimensional data and costly simulations,

2. how to choose tuning parameters and compare the performance of ABC and related methods based on simulations and observed data only; that is, without knowing the true posterior distribution, and

3. how to best choose summary statistics for ABC and related methods when given a very large number of candidate summary statistics.

Existing Methodology. There is an extensive literature on ABC methods; we refer the reader to Marin et al. (2012); Prangle et al. (2014) and references therein for a review. The connection between ABC and CDE has been noted by others; in fact, ABC itself can be viewed as a hybrid between nearest neighbors and kernel density estimators (Blum, 2010; Biau et al., 2015). As Biau et al. point out, the fundamental problem from a practical perspective is how to select the parameters in ABC methods in the absence of a priori information regarding the posterior . Nearest neighbors and kernel density estimators are also known to perform poorly in settings with a large amount of summary statistics (Blum, 2010), and they are difficult to adapt to different data types (e.g., mixed discrete-continuous statistics and functional data). Few works attempt to use other CDE methods to estimate posterior distributions. At the time of submission of this paper, the only works in this direction are Papamakarios and Murray (2016) and Lueckmann et al. (2017), which are based on conditional neural density estimation, Fan et al. (2013) and Li et al. (2015), which use a mixture of Gaussian copulas to estimate the likelihood function, and Raynal et al. (2017), which suggests random forests for quantile estimation.

Although the above mentioned methods utilize specific CDE models to estimate posterior distributions, they do not fully explore other advantages of a CDE framework; such as, in methods assessment, in variable selection, and in tuning the final estimates with CDE as a goal (see Sections 2.2, 2.3 and 4). Summary statistics selection is indeed a nontrivial challenge in likelihood-free inference: ABC methods depend on the choice of statistics and distance function of observables when comparing observed and simulated data, and using the “wrong” summary statistics can dramatically affect their performance. For a general review of dimension reduction methods for ABC, we refer the reader to Blum et al. (2013), who classify current approaches in three classes: (a) best subset selection approaches, (b) projection techniques, and (c) regularization techniques. Many of these approaches still face either significant computational issues or attempt to find good summary statistics for certain characteristics of the posterior rather than the entire posterior itself. For instance, Creel and Kristensen (2016) propose a best subset selection of summary statistics based on improving the estimate of the posterior mean . There are no guarantees however that statistics that lead to good estimates of will be sufficient for or even yield reasonable estimates of .1 We will elaborate on this point in Section 2.2.

Moreover, the current literature on likelihood-free inference lacks methods that allow one to directly compare the performance of different posterior distribution estimators. Given a collection of estimates (obtained by, e.g., ABC methods with different tolerance levels, sampling techniques, and so on), an open problem is to how to select the estimate that is closest to the true posterior density for observed data . Some goodness-of-fit techniques have been proposed (for example, Prangle et al. 2014 compute the goodness of fit based on coverage properties), but although diagnostic tests are useful, they do not capture all aspects of the density estimates. Some density estimates which are not close to the true density can pass all tests (Breiman, 2001; Bickel et al., 2006), and the situation is even worse in conditional density estimation. Figure 1 shows a toy example where both probability-probability (PP) and coverage plots wrongly indicate an excellent fit,2 but the estimated posterior distributions are far from the true densities; here, and . Indeed, standard diagnostic tests will not detect an obvious flaw in conditional density estimates that, as in this example, are equal to the marginal distribution for all .

In this paper, we show how one can improve ABC methods with a novel CDE surrogate loss function (Eq. 3) that measures how well one estimates the entire posterior distribution ; see Section 2.2 for a discussion of its theoretical properties. Our proposed method, ABC-CDE, starts with a rough approximation from an ABC sampler and then directly estimates the conditional density exactly at the point using a nonparametric conditional density estimator. Unlike other ABC post-adjustment techniques in the literature (e.g. Beaumont et al. (2002) and Blum and François (2010)), our method is optimized for estimating posteriors, and corrects for changes in the ABC posterior sample beyond the posterior mean and variance. We also present a general framework (based on CDE) that can handle different types of data (including functional data, mixed variables, structured data, and so on) as well as a larger number of summary statistics. With, for example, FlexCode (Izbicki and Lee, 2017) one can convert any existing regression estimator to a conditional density estimator. Recent neural mixture density networks that directly estimate posteriors for complex data (Papamakarios and Murray, 2016; Lueckmann et al., 2017) also fit into this ABC-CDE framework, and are especially promising for image data. Hence, with ABC-CDE, we take a different approach to address the curse of dimensionality than in traditional likelihood-free inference methods. In standard ABC, it is essential to choose (a smaller set of) informative summary statistics to properly measure user-specified distances between observed and simulated data. The main dimension reduction in our framework is implicit in the conditional density estimation and our CDE loss function. Depending of the choice of estimator, we can adapt to different types of sparse structure in the data, and just as in high-dimensional regression, handle a large amount of covariates, even without relying on a prior dimension reduction of the data and a user-specified distance function of observables.

Finally, we note that ABC summary statistic selection and goodness-of-fit techniques are typically designed to estimate posterior distributions accurately for every sample . In reality, we often only care about estimates for the particular sample that is observed, and even if a method produces poor estimates for some it can still produce good estimates for . The methods we introduce in this paper take this into consideration, and directly aim at constructing, evaluating and tuning estimators for the posterior density at the observed value .

The organization of the paper is as follows: Section 2 describes and presents theoretical results for how a CDE framework and a surrogate loss function address issues (i)–(iii). Section 3 includes simulated experiments that demonstrate that our proposed methods work in practice. In Section 4, we revisit CDE in the context of ABC and demonstrate how direct estimation of posteriors with CDE and the surrogate loss can replace further iterations with standard ABC. We then end by providing links to general-purpose CDE software that can be used in likelihood-free inference in different settings. We refer the reader to the Appendix for proofs, a comparison to post-processing regression adjustment methods, and two applications in astronomy.

## 2 Methods

In this section we propose a CDE framework for (i) estimating the posterior density (Section 2.1), (ii) comparing the performance of ABC and related methods (Section 2.2), and (iii) choosing optimal summary statistics (Section 2.3).

### 2.1 Estimating the Posterior Density via CDE

Given a prior distribution and a likelihood function , our goal is to compute the posterior distribution , where is the observed sample. We assume we know how to sample from for a fixed value of .

A naive way of estimating via CDE methods is to first generate an i.i.d. sample by sampling and then for each pair. One applies the CDE method of choice to , and then simply evaluates the estimated density at . The naive approach however may lead to poor results because some are far from the observed data . To put it differently, standard conditional density estimators are designed to estimate for every , but in ABC applications we are only interested in .

To solve this issue, one can instead estimate using a training set that only consists of sample points close to . This training set is created by a simple ABC rejection sampling algorithm. More precisely: for a fixed distance function (that could be based on summary statistics) and tolerance level , we construct a sample according to Algorithm 1.

To this new training set , we then apply our conditional density estimator, and finally evaluate the estimate at . This procedure can be regarded as an ABC post-processing technique (Marin et al., 2012): the first (ABC) approximation to the posterior is obtained via the sample , which can be seen as a sample from . That is, the standard ABC rejection sampler is implicitly performing conditional density estimation using an i.i.d. sample from the joint distribution of the data and the parameter. We take the results of the ABC sampler and estimate the conditional density exactly at the point using other forms of conditional density estimation. If done correctly, the idea is that we can improve upon the original ABC approximation even without, as is currently the norm, simulating new data or decreasing the tolerance level .

###### Remark 1.

For simplicity, we focus on standard ABC rejection sampling, but one can use other ABC methods, such as sequential ABC (Sisson et al., 2007) or population Monte Carlo ABC (Beaumont et al., 2009), to construct . The data can either be the original data vector, or a vector of summary statistics. We revisit the issue of summary statistic selection in Section 2.3.

Next, we review FlexCode (Izbicki and Lee, 2017), which we currently use as a general-purpose methodology for estimating . However, many aspects of the paper (such as the novel approach to method selection without knowledge of the true posterior) hold for other CDE and ABC methods as well. In Section 4, for example, we use our surrogate loss to choose the tuning parameters of a nearest-neighbors kernel density estimator (Equation 14), which includes ABC as a special case.

FlexCode as a “Plug-In” CDE Method. For simplicity, assume that we are interested in estimating the posterior distribution of a single parameter , even if there are several parameters in the problem.3 Similar ideas can be used if one is interested in estimating the (joint) posterior distribution for more than one parameter (see Izbicki and Lee 2017 for more details on how FlexCode can be adapted to those settings). In the context of ABC, typically represents a set of statistics computed from the original data; recall Remark 1. We start by specifying an orthonormal basis in . This basis will be used to model the density as a function of . Note that there is a wide range of (orthogonal) bases one can choose from to capture any challenging shape of the density function of interest (Mallat, 1999). For instance, a natural choice for reasonably smooth functions is the Fourier basis:

 ϕ1(θ)=1;ϕ2i+1(θ)=√2sin(2πiθ), i∈N;ϕ2i(θ)=√2cos(2πiθ), i∈N

The key idea of FlexCode is to notice that, if for every , then it is possible to expand as where the expansion coefficients are given by

 βi(x)=E[ϕi(θ)|x]. (1)

That is, each is a regression function. The FlexCode estimator is defined as where are regression estimates. The cutoff in the series expansion is a tuning parameter that controls the bias/variance tradeoff in the final density estimate, and which we choose via data splitting (Section 2.2).

With FlexCode, the problem of high-dimensional conditional density estimation boils down to choosing appropriate methods for estimating the regression functions . The key advantage of FlexCode is that it offers more flexible CDE methods: By taking advantage of existing regression methods, which can be “plugged in” into the CDE estimator, we can adapt to the intrinsic structure of high-dimensional data (e.g., manifolds, irrelevant covariates, and different relationships between and the response ), as well as handle different data types (e.g., mixed data and functional data) and massive data sets (by using, e.g., xgboost (Chen and Guestrin, 2016)). See Izbicki and Lee (2017) and the upcoming LSST-DESC photo-z DC1 paper for examples. An implementation of FlexCode that allows for wavelet bases can be found at https://github.com/rizbicki/FlexCoDE (R; R Core Team 2013) and https://github.com/tpospisi/flexcode (Python).

### 2.2 Method Selection: Comparing Different Estimators of the Posterior

Definition of a Surrogate Loss. Ultimately, we need to be able to decide which approach is best for approximating without knowledge of the true posterior. Ideally we would like to find an estimator such that the integrated squared-error (ISE) loss

 Lxo(ˆf,f)=∫(ˆf(θ|xo)−f(θ|xo))2dθ (2)

is small. Unfortunately, one cannot compute without knowing the true , which is why method selection is so hard in practice. To overcome this issue, we propose the surrogate loss function

 Lϵxo(ˆf,f)=∫∫(ˆf(θ|x)−f(θ|x))2f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dθdx, (3)

which enforces a close fit in an -neighborhood of . Here, the denominator is simply a constant that makes a proper density in .

The advantage with the above definition is that we can directly estimate from the ABC posterior sample. Indeed, it holds that can be written as

 ∫∫ˆf2(θ|x)f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dθdx−2∫∫ˆf(θ|x)f(θ|x)f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dθdx+Kf =EX′[∫ˆf2(θ|X′)dθ]−2E(θ′,X′)[ˆf(θ′|X′)]+Kf, (4)

where is a random vector with distribution induced by a sample generated according to the ABC rejection procedure in Algorithm 1; and is a constant that does not depend on the estimator . It follows that, given an independent validation or test sample of size of the ABC algorithm, , we can estimate (up to the constant ) via

 ˆLϵxo(ˆf,f)=1B′B′∑k=1∫ˆf2(θ|x′k)dθ−21B′B′∑k=1ˆf(θ′k|x′k) (5)

When given a set of estimators , we select the method with the smallest estimated surrogate loss,

 ˆf∗:=argminˆf∈FˆLϵxo(ˆf,f)
###### Example 1 (Model selection based on CDE surrogate loss versus regression MSE loss).

Suppose we wish to estimate the posterior distribution of the mean of a Gaussian distribution with variance one. The left plot of Figure 2 shows the performance of a nearest-neighbors kernel density estimator (Equation 14) with the kernel bandwidth and the number of nearest neighbors chosen via (i) the estimated surrogate loss of Equation 5 versus (ii) a standard regression mean-squared-error loss.4 The proposed surrogate loss clearly leads to better estimates of the posterior with smaller true loss (Equation 2). Indeed, as the right plot shows, if one chooses tuning parameters via the standard regression mean-squared-error loss, the estimates end up being very far from the true distribution.

Properties of the Surrogate Loss. Next we investigate the conditions under which the estimated surrogate loss is close to the true loss; the proofs can be found in Appendix. The following theorem states that, if is a smooth function of , then the (exact) surrogate loss is close to for small values of .

###### Theorem 1.

Assume that, for every , satisfies the Hölder condition of order with a constant 5 such that . Then

The next theorem shows that the estimator in Equation 5 does indeed converge to the true loss .

###### Theorem 2.

Let be as in Equation A.1. Under the assumptions of Theorem 4,

Under some additional conditions, it is also possible to guarantee that not only the estimated surrogate loss is close to the true loss, but that the result holds uniformly for a finite class of estimators of the posterior distribution. This is formally stated in the following theorem.

###### Theorem 3.

Let be a set of estimators of . Assume that there exists such that for every , , and .6 Moreover, assume that for every , satisfies the Hölder condition of order with constants such that . Then, for every ,

 P(maxˆf∈F|ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|≥Kϵϵβ+ν)≤2me−B′ν22(M2+2M)2.

The next corollary shows that the procedure we propose in this section, with high probability, picks an estimate of the posterior density that has a true loss that is close to the true loss of the best method in .

###### Corollary 1.

Let be the best estimator in according to the estimated surrogate loss, and let be the best estimator in according to the true loss. Then, under the assumptions from Theorem 6, with probability at least ,

### 2.3 Summary Statistics Selection

In a typical ABC setting, there are a large number of available summary statistics. Standard ABC fails if all of them are used simultaneously, especially if some statistics carry little information about the parameters of the model (Blum, 2010).

One can use ABC-CDE as a way of either (i) directly estimating when there are a large number of summary statistics,7 or (ii) assigning an importance measure to each summary statistic to guide variable selection in ABC and related procedures.

There are two versions of ABC-CDE that are particularly useful for variable selection: FlexCode-SAM8 and FlexCode-RF9. Izbicki and Lee (2017) show that both estimators automatically adapt to the number of relevant covariates, i.e., the number of covariates that influence the distribution of the response. In the context of ABC, this means that these methods are able to automatically detect which summary statistics are relevant in estimating the posterior distribution of . Corollary 1 from Izbicki and Lee (2017) implies that, if indeed only out of all summary statistics influence the distribution of , then the rate of convergence of these methods is instead of , where and are numbers associated to the smoothness of . The former rate implies a much faster convergence: if , it is essentially the rate one would obtain if one knew which were the relevant statistics. In such a setting, there is no need to explicitly perform summary statistic selection prior to estimating the posterior; FlexCode-SAM or FlexCode-RF automatically remove irrelevant covariates.

More generally, one can use FlexCode to compute an importance measure for summary statistics (to be used in other procedures than FlexCode). It turns out that one can infer the relevance of the :th summary statistic in posterior estimation from its relevance in estimating the first regression functions in FlexCode — even if we do not use FlexCode for estimating the posterior. More precisely, assume that is a vector of summary statistics, and let . Define the relevance of variable to the posterior distribution as

 rj:=∫∫∫(f(θ|x)−f(θ|x′j))2dxdx′jdθ,

and its relevance to the regression in Equation 1 as

 ri,j:=∫∫(βi(x)−βi(x′j))2dxdx′j.

Under some smoothness assumptions with respect to , the two metrics are related.

###### Assumption 1 (Smoothness in θ direction).

, we assume that the Sobolev space of order and radius ,10 where is viewed as a function of , and and are such that and .

###### Proposition 1.

Under Assumption 2,

Now let denote a measure of importance of the :th summary statistic in estimating regression (Equation 1). For instance, for FlexCode-RF, may represent the mean decrease in the Mean Squared Error (Hastie et al., 2001); for FlexCode-SAM, may be value of the indicator function for the :th summary statistic when estimating . Motivated by Proposition 2, we define an importance measure for the :th summary statistic in posterior estimation according to

 uj:=1II∑i=1ui,j. (6)

We can use these values to select variables for estimating via other ABC methods. For example, one approach is to choose all summary statistics such that , where the threshold value is defined by the user. We will further explore this approach in Section 3.

In summary, our procedure has two main advantages compared to current state-of-the-art approaches for selecting summary statistics in ABC: (i) it chooses statistics that lead to good estimates of the entire posterior distribution rather than surrogates, such as, the regression or posterior mean (Aeschbacher et al., 2012; Creel and Kristensen, 2016; Faisal et al., 2016), and (ii) it is typically faster than most other approaches; in particular, it is significantly faster than best subset selection which scales as , whereas, e.g., FlexCode-RF scales as , and FlexCode-SAM scales as .

## 3 Experiments

### 3.1 Examples with Known Posteriors

We start by analyzing examples with well-known and analytically computable posterior distributions:

• 1. Mean of a Gaussian with known variance. , . We repeat the experiments for in an equally spaced grid with ten values between 0.5 and 100.

• 2. Precision of a Gaussian with unknown precision. , . We set , , and repeat the experiments choosing and such that and is in an equally spaced grid with ten values between 0.1 and 5.

In the Appendix we also investigate a third setting, “Mean of a Gaussian with unknown precision”, with results similar to those shown here in the main manuscript.

In all examples here, observed data are drawn from a distribution. We run each experiment 200 times, that is, with 200 different values of . The training set , which is used to build conditional density estimators, is constructed according to Algorithm 1 with and a tolerance level that corresponds to an acceptance rate of 1%. For the distance function , we choose the Euclidean distance between minimal sufficient statistics normalized to have mean zero and variance 1; these statistics are for scenario 1 and for scenario 2. We use a Fourier basis for all FlexCode experiments in the paper, but wavelets lead to similar results.

We compare the following methods (see Section 4 and the appendix for a comparison between ABC, regression adjustment methods and ABC-CDE with a standard kernel density estimator):

• ABC: rejection ABC method with the minimal sufficient statistics (that is, apply a kernel density estimator to the coordinate of , with bandwidth chosen via cross-validation),

• FlexCode_Raw-NN: FlexCode estimator with Nearest Neighbors regression,

• FlexCode_Raw-Series: FlexCode estimator with Spectral Series regression (Lee and Izbicki, 2016), and

• FlexCode_Raw-RF: FlexCode estimator with Random Forest regression.

The three FlexCode estimators (denoted by “Raw”) are directly applied to the sorted values of the original covariates . That is, we do not use minimal sufficient statistics or other summary statistics. To assess the performance of each method, we compute the true loss (Equation 2) for each . In addition, we estimate the surrogate loss according to Equation 5 using a new sample of size from Algorithm 1.

#### CDE and Method Selection

In this section, we investigate whether various ABC-CDE methods improve upon standard ABC for the settings described above. We also evaluate the method selection approach in Section 2.2 by comparing decisions based on estimated surrogate losses to those made if one knew the true ISE losses.

Figure (h)h, left, shows how well the methods actually estimate the posterior density for Settings 1-2. Panel (a) and (e) list the proportion of times each method returns the best results (according to the true loss from Equation 2). Generally speaking, the larger the prior variance, the better ABC-CDE methods perform compared to ABC. In particular, while for small variances ABC tends to be better, for large prior variances, FlexCode with Nearest Neighbors regression tends to give the best results. FlexCode with expansion coefficients estimated via Spectral Series regression is also very competitive. Panels (c) and (g) confirms these results; here we see the average true loss of each method along with standard errors.

Figure (h)h, right, summarizes the performance of our method selection algorithm. Panels (b) and (f) list the proportion of times the method chosen by the true loss (Equation 2) matches the method chosen via the estimated loss (Equation 5) in all pairwise comparisons; that is, the plot tells us how often the method selection procedure proposed in Section 2.2 actually works. We present two variations of the algorithm: in the first version (see triangles), we include all the data; in the second version (see circles), we remove cases where the confidence interval for contains zero (i.e., cases where we cannot tell whether or performs better). The baseline shows what one would expect if the method selection algorithm was totally random. The plots indicate that we, in all settings, roughly arrive at the same conclusions with the estimated surrogate loss as we would if we knew the true loss.

For the sake of illustration, we have also added panels (d) and (h), which show a scatterplot of differences between true losses versus the differences between the estimated losses for ABC and FlexCode_Raw-NN for the setting with and , respectively. The fact that most samples are either in the first or third quadrant further confirms that the estimated surrogate loss is in agreement with the true loss in terms of which method best estimates the posterior density.

#### Summary Statistic Selection

In this Section we investigate the performance of FlexCode-RF for summary statistics selection (Sec. 2.3). For this purpose, the following summary statistics were used:

1. Mean: average of the data points;

2. Median: median of the data points;

3. Mean 1: average of the first half of the data points;

4. Mean 2: average of the second half of the data points;

5. SD: standard deviation of the data points;

6. IQR: interquartile range of the data points;

7. Quartile 1: first quantile of the data points;

8. Independent random variables , that is, random noise

Figure (d)d summarizes the results of fitting FlexCode-RF and ABC to these summary statistics for the different scenarios. Panel (a) and (c) show the true loss as we increase the number of statistics. More precisely: the values at represent the true loss of ABC (left) and FlexCode-RF (right) when using only the mean (i.e., the first statistic); the points at indicate the true loss of the estimates using only the mean and the median (i.e., the first and second statistics) and so on. We note that FlexCode-RF is robust to irrelevant summary statistics: the method virtually behaves as if they were not present. This is in sharp contrast with standard ABC, whose performance deteriorates quickly with added noise or nuisance statistics.

Furthermore, panels (b) and (d) show the average importance of each statistic, defined according to Equation 6, where is the mean decrease in the Gini index. These plots reveal that FlexCode-RF typically assigns a high score to sufficient summary statistics or to statistics that are highly correlated to sufficient statistics. For instance, in panel (b) (estimation of the mean of the distribution), measures of location are assigned a higher importance score, whereas measures of dispersion are assigned a higher score in panel (d) (estimation of the precision of the distribution). In all examples, FlexCode-RF assigns zero importance to random noise statistics. We conclude that our method for summary statistic selection indeed identifies relevant statistics for estimating the posterior well.

## 4 Approximate Bayesian Computation and Beyond

In this section, we show how one can use our surrogate loss to choose the tuning parameters in standard ABC with a nearest neighbors kernel smoother.

### 4.1 ABC with Fewer Simulations

As noted by (Blum, 2010; Biau et al., 2015), ABC is equivalent to a kernel-CDE. More specifically, it can be seen as a “nearest-neighbors” kernel-CDE (NN-KCDE) defined by

 ˆfnn(θ∣x)=1kk∑i=1Kh(ρ(θ,θsi(x))), (7)

where represents the index of the th nearest neighbor to the target point in covariate space, and we compute the conditional density of at by applying a kernel smoother with bandwidth to the points closest to .

For a given set of generated data, the above is equivalent to selecting the ABC threshold as the -th quantile of the observed distances. This is commonly used in practice as it is more convenient than determining a priori. However, as pointed out by Biau et al. (2015) (Section 4; remark 1), there is currently no good methodology to select both and in an ABC k-nearest neighbor estimate.

Given the connection between ABC and NN-KCDE, we propose to use our surrogate loss to tune the estimator; selecting and such that they minimize the estimated surrogate loss in Equation 5. In this sense, we are selecting the “optimal” ABC parameters after generating some of the data: having generated 10,000 points, it may turn out that we would have preferred a smaller tolerance level and that only using the closest 1,000 points would better approximate the posterior.

Example with Normal Posterior. To demonstrate the effectiveness of our surrogate loss in reducing the number of simulations, we draw data where . We examine the role of ABC thresholds by fitting the model for several values of the threshold with observed data . (A similar example with a two-dimensional normal distribution can be found in the Appendix.) For each threshold, we perform rejection sampling until we retain ABC points. We select ABC thresholds to fix the acceptance rate of the rejection sampling. Those acceptance rates are then used in place of the actual tolerance level for easier comparison.

The left panel of Figure 5 shows examples of posterior densities for varying acceptance rates. For the highest acceptance rate of 1 (corresponding to the ABC tolerance level ), the ABC posterior (top left) is the prior distribution and thus a poor estimate. In contrast, the two ABC-CDE methods (FlexCode-NN and NN-KCDE) have a decent performance even at an acceptance rate of 1; more generally, they perform well at a higher acceptance rate than standard ABC.

To corroborate this qualitative look, we examine the loss for each method. The right panel of Figure 5 plots the true and surrogate losses against the acceptance rate for the given methods. As seen in Section 3.1.1, the surrogate loss provides the same conclusion as the (unavailable in practice) true loss. As the acceptance rate decreases, the ABC realizations more closely approximate the true posterior and the ABC estimate of the posterior improves. The main result is that NN-KCDE and FlexCode-NN have roughly constant performance over all values of the acceptance rate. As such, we could generate only 1,000 realizations of the ABC sample at an acceptance rate of 1 and achieve similar result as standard ABC generating 100,000 values at an acceptance rate of 0.01.

There are two different sources of improvement: the first exhibited by NN-KCDE amounts to selecting the “optimal” ABC parameters and using surrogate loss. However, as FlexCode-NN performs slightly better than NN-KCDE for the same sample, there is an additional improvement in using CDE methods other than kernel smoothers; this difference becomes more pronounced for high-dimensional and complex data (see Izbicki and Lee 2017 for examples of when traditional kernel smoothers fail).

## 5 Conclusions

In this work, we have demonstrated three ways in which our conditional estimation framework can improve upon approximate Bayesian computational methods for next-generation complex data and simulations.

First, realistic simulation models are often such that the computational cost of generating a single sample is large, making lower acceptance ratios unrealistic.

Secondly, our ABC-CDE framework allows one to compare ABC and related methods in a principled way, making it possible to pick the best method for a given data set without knowing the true posterior. Our approach is based on a surrogate loss function and data splitting. We note that a related cross-validation procedure to choose the tolerance level in ABC has been proposed by Csillery et al. (2012), albeit using a loss function that is appropriate for point estimation only.

Finally, when dealing with complex models, it is often difficult to know exactly what summary statistics would be appropriate for ABC. Nevertheless, the practitioner can usually make up a list of a large but redundant number of candidate statistics, including statistics generated with automatic methods. As our results show, FlexCode-RF (unlike ABC) is robust to irrelevant statistics. Moreover, FlexCode, in combination with RF for regression, offers a way of evaluating the importance of each summary statistic in estimating the full posterior distribution; hence, these importance scores could be used to choose relevant summary statistics for ABC and any other method used to estimate posteriors.

In brief, there are really two estimation problems in ABC-CDE: The first is that of estimating . ABC-CDE starts with a rough approximation from an ABC sampler and then directly estimates the conditional density exactly at the point using a nonparametric conditional density estimator. The second is that of estimating the integrated squared error loss (Eq. 2). Here we propose a surrogate loss that weights all points in the ABC posterior sample equally, but a weighted surrogate loss could potentially return more accurate estimates of the ISE. For example, Figures 9 (left) and 10 in the Appendix show that NN-KCDE perform better than ABC post-processing techniques. The current estimated loss, however, cannot identify a difference in ISE loss between NN-KCDE and “Blum” because of the rapidly shifting posterior in the vicinity of .

Acknowledgments. We are grateful to Rafael Stern for his insightful comments on the manuscript. We would also like to thank Terrence Liu, Brendan McVeigh, and Matt Walker for the NFW simulation code and to Michael Vespe for help with the weak lensing simulations in the Appendix. This work was partially supported by NSF DMS-1520786, Fundação de Amparo à Pesquisa do Estado de São Paulo (2017/03363-8) and CNPq (306943/2017-4).

Links to Nonparametric Software Optimized for CDE:

## Appendix A Proofs

### a.1 Results on the surrogate loss

###### Theorem 4.

Assume that, for every , satisfies the Hölder condition of order with a constant 11 such that . Then

 |Lϵxo(ˆf,f)−Lxo(ˆf,f)|=KHϵβ=O(ϵβ)
###### Proof.

First, notice that

 Lxo(ˆf,f)=∫gθ(xo)dθ∫f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dx=∫∫gθ(xo)f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dxdθ.

It follows that

 |Lϵxo(ˆf,f)−Lxo(ˆf,f)| =∣∣ ∣∣∫(∫(gθ(x)−gθ(xo))f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dx)dθ∣∣ ∣∣ ≤∫(∫|gθ(x)−gθ(xo)|f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dx)dθ ≤∫(∫Kθd(x,xo)βf(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dx)dθ ≤∫Kθϵβ(∫f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dx)dθ =ϵβ∫Kθ1dθ=KHϵβ

 Lϵxo(ˆf,f) = ∫∫ˆf2(θ|x)f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dθdx−2∫∫ˆf(θ|x)f(θ|x)f(x)I(d(x,xo)<ϵ)P(d(X,xo)<ϵ)dθdx+Kf =EX′[∫ˆf2(θ|X)dθ]−2E(θ′,X′)[ˆf(θ|X)]+Kf, (8)
###### Theorem 5.

Let be as in Equation A.1. Under the assumptions of Theorem 4,

 |ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|=O(ϵβ)+OP(1/√B′)
###### Proof.

Using the triangle inequality,

 |ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)| ≤|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)|+|Lϵxo(ˆf,f)−Lxo(ˆf,f)| =O(ϵβ)+OP(1/√B′),

where the last inequality follows from Theorem 4 and the fact that is an average of iid random variables. ∎

###### Lemma 1.

Assume there exists such that for every and . Then

 P(|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)|≥ν)≤2e−B′ν22(M2+2M)2
###### Proof.

Notice that

 ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)=1B′B′∑k=1Wk−E[W1],

where , with iid. The conclusion follows from Hoeffding’s inequality and the fact that

###### Lemma 2.

Under the assumptions of Lemma 1 and if satisfies the Hölder condition of order with constants such that ,

 P(|ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|≥KHϵβ+ν)≤2e−B′ν22(M2+2M)2,
###### Proof.

Notice that

 | ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|−KHϵβ =|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)+Lϵxo(ˆf,f)−Lxo(ˆf,f)|−KHϵβ ≤|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)|+|Lϵxo(ˆf,f)−Lxo(ˆf,f)|−KHϵβ ≤|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)|,

where the last line follows from Theorem 4. It follows that

 |ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|≥KHϵβ+ν⇒|ˆLϵxo(ˆf,f)+Kf−Lϵxo(ˆf,f)|≥ν.

The conclusion follows from Lemma 1. ∎

###### Theorem 6.

Let be a set of estimators of . Assume there exists such that for every , , and . 12 Moroever, assume that for every , satisfies the Hölder condition of order with constants such that . Then,

 P(maxˆf∈F|ˆLϵxo(ˆf,f)+Kf−Lxo(ˆf,f)|≥Kϵϵβ+ν)≤2me−B′ν22(M2+2M)2.
###### Proof.

The theorem follows from Lemma 2 and the union bound. ∎

###### Corollary 2.

Let be the best estimator in according to the estimated surrogate loss, and let be the best estimator in according to the true loss. Then, under the assumptions from Theorem 6, with probability at least ,

 Lxo(ˆf∗,f)≤Lxo(f∗,f)+2