Confidence Sets and Hypothesis Tests in a LikelihoodFree Inference Setting
Abstract
Parameter estimation, statistical tests and confidence sets are the cornerstones of classical statistics that allow scientists to make inferences about the underlying process that generated the observed data. A key question is whether one can still construct hypothesis tests and confidence sets with proper coverage and high power in a socalled likelihoodfree inference (LFI) setting; that is, a setting where the likelihood is not explicitly known but one can forwardsimulate observable data according to a stochastic model. In this paper, we present ACORE (Approximate Computation via Odds Ratio Estimation), a frequentist approach to LFI that first formulates the classical likelihood ratio test (LRT) as a parametrized classification problem, and then uses the equivalence of tests and confidence sets to build confidence regions for parameters of interest. We also present a goodnessoffit procedure for checking whether the constructed tests and confidence regions are valid. ACORE is based on the key observation that the LRT statistic, the rejection probability of the test, and the coverage of the confidence set are conditional distribution functions which often vary smoothly as a function of the parameters of interest. Hence, instead of relying solely on samples simulated at fixed parameter settings (as is the convention in standard Monte Carlo solutions), one can leverage machine learning tools and data simulated in the neighborhood of a parameter to improve estimates of quantities of interest. We demonstrate the efficacy of ACORE with both theoretical and empirical results. Our implementation is available on Github.
\commentParameter estimation, statistical tests and confidence sets are the cornerstones of classical statistics that allow scientists to make inferences about the underlying process that generated the observed data. A key question is whether one can still construct hypothesis tests and confidence sets with proper coverage and high power in a socalled likelihoodfree inference (LFI) setting, where the likelihood is not explicitly known but one can forwardsimulate observable data according to a stochastic model. In this paper, we present ACORE (Approximate Computation via Odds Ratio Estimation), a frequentist approach to LFI that first formulates the classical likelihood ratio test (LRT) as a parametrized classification problem, and then uses the equivalence of tests and confidence sets to build confidence regions for parameters of interest. We also present a goodnessoffit test for checking whether the constructed tests and confidence regions are valid. ACORE is based on the key observation that the LRT statistic, the rejection probability of the test, and the coverage of the confidence set are conditional distribution functions which often vary smoothly as a function of the the parameters of interest. Hence, instead of relying solely on samples simulated at fixed parameter settings (as is the convention in standard Monte Carlo solutions), one can leverage machine learning tools and data simulated in the neighborhood of a parameter to improve estimates of quantities of interest. We demonstrate the efficacy of ACORE with both theoretical and empirical results.
1 Introduction
Parameter estimation, statistical tests and confidence sets are the cornerstones of classical statistics that relate observed data to properties of the underlying statistical model. Most frequentist procedure with good statistical performance (e.g., high power) require explicit knowledge of a likelihood function. However, in many science and engineering applications, complex phenomena are modeled by forward simulators that implicitly define a likelihood function: For example, given input parameters , a statistical model of our environment, climate or universe may combine deterministic dynamics with random fluctuations to produce synthetic data . Simulationbased inference without an explicit likelihood is called likelihoodfree inference (LFI).
The literature on LFI is vast. Traditional LFI methods, such as Approximate Bayesian Computation (ABC; Beaumont et al. 2002; Marin et al. 2012; Sisson et al. 2018), estimate posteriors by using simulations sufficiently close to the observed data, hence bypassing the likelihood. More recently, several approaches that leverage machine learning algorithms have been proposed; these either directly estimate the posterior distribution (Marin et al., 2016; Chen & Gutmann, 2019; Izbicki et al., 2019; Greenberg et al., 2019) or the likelihood function (Izbicki et al., 2014; Thomas et al., 2016; Price et al., 2018; Ong et al., 2018; Lueckmann et al., 2019; Papamakarios et al., 2019). We refer the reader to Cranmer et al. (2019) for a recent review of the field.
A question that has not received much attention so far is whether one, in an LFI setting, can construct inference techniques with good frequentist properties. Frequentist procedures have nevertheless played an important role in many fields. In high energy physics for instance, classical statistical techniques (e.g., hypothesis testing for outlier detection) have resulted in discoveries of new physics and other successful applications (Feldman & Cousins, 1998; Cranmer, 2015; Cousins, 2018). Even though controlling type I error probabilities is important in these applications, most LFI methods do not have guarantees on validity or power. Ideally, a unified LFI approach should

be computationally efficient in terms of the number of required simulations,

handle highdimensional data from different sources (without, e.g., predefined summary statistics),

produce hypothesis tests and confidence sets that are valid; that is, have the nominal type I error or confidence level,

produce hypothesis tests with high power or, equivalently, confidence sets with a small expected size,

provide diagnostics for checking empirical coverage or for checking how well the (approximate likelihood) model fits simulated data.
In this paper, we present ACORE (Approximate Computation via Odds Ratio Estimation), a frequentist approach to LFI, which addresses the above mentioned concerns.
Figure 2 summarizes the ACORE work structure: ACORE first compares synthetic data from the simulator to a reference distribution by computing an “Odds Ratio”. The odds ratio can be learnt with a probabilistic classifier, such as a neural network, suitable for the data at hand. As we shall see, the estimated odds ratio is an approximation of the likelihood ratio statistic (Proposition 3.1). The ACORE test statistic (Equation 4), together with an estimate of the “Critical Value” (Algorithms 1 and 2), can be used for hypothesis testing or for finding a confidence set for . ACORE also includes “Diagnostics” (Section 3.3) for computing the empirical coverage of the constructed confidence set for .
At the heart of ACORE is the key observation that the likelihood ratio statistic, the critical value of the test, and the coverage of the confidence set are conditional distribution functions which often vary smoothly as a function of the (unknown) parameters of interest. Hence, instead of relying solely on samples simulated at fixed parameter settings (as is the convention in standard Monte Carlo solutions), one can leverage machine learning tools and data simulated in the neighborhood of a parameter to improve estimates of quantities of interest and decrease the total number of simulated data points. Our contribution is threefold:

a new procedure for estimating the likelihood ratio statistic, which uses machine learning (ML) methods and does not require repeated sampling at each or a separate interpolation or calibration step,

an efficient procedure for estimating the critical value that guarantees valid tests and confidence sets (again using ML methods and without repeated sampling at each ),

a new goodnessoffit technique for computing empirical coverage of constructed confidence sets as a function of the unknown parameters .
Finally, ACORE is simple and modular by construction. One can easily switch out, generalize or take pieces of the framework and apply it to any similar machinelearningbased LFI setting. The theoretical results of Section 3.1 hold for a general setting. In addition, given the vast arsenal of existing probabilistic classifiers developed in the literature, ACORE can be applied to many different types of complex data (e.g., images, time series and functional data). In Section 4, we show empirical results connecting the power of the constructed hypothesis tests to the performance of the classifier. Note also that Algorithms 1 and 2 for estimating parametrized critical values apply to any hypothesis test on of the form of Equation 2 for any test statistic . The goodnessoffit procedure in Section 3.3 for checking empirical coverage as a function of is also not tied to odds ratios.
1.1 Related Work
The problem of constructing confidence intervals with good frequentist properties has a long history in statistics (Neyman, 1937; Feldman & Cousins, 1998; Chuang & Lai, 2000). One of the earlier simulationbased approaches was developed in high energy physics (HEP) by Diggle & Gratton (1984); their proposed scheme of estimating the likelihood and likelihoood ratio statistic nonparametrically by histograms of photon counts would later become a key component in the discovery of the Higgs Boson (Aad et al., 2012). However, traditional approaches for building confidence regions and hypothesis tests in LFI rely on a series of Monte Carlo samples at each parameter value (Barlow & Beeston, 1993; Weinzierl, 2000; Schafer & Stark, 2009). Thus, these approaches quickly become inefficient with large or continuous parameter spaces. Traditional nonparametric approaches also have difficulties handling highdimensional data without losing key information.
LFI has recently benefited from using powerful machine learning tools like deep learning to estimate likelihood functions and likelihood ratios for complex data. Successful application areas include HEP Guest et al. (2018), astronomy Alsing et al. (2019) and neuroscience Gonçalves et al. (2019). ACORE has some similarities to the work of Cranmer et al. (2015) which also uses machine learning methods for frequentist inference in an LFI setting. Other elements of ACORE such as leveraging the ability of ML algorithms to smooth over parameter space turning a density ratio estimate into a supervised classification problem have also previously been used in LFI settings: Works that smooth over parameter space include, e.g., Gaussian processes (Leclercq, 2018; Frate et al., 2017) and neural networks Baldi et al. (2016). Works that turn a density ratio into a classification problem include applications to generative models (see Mohamed & Lakshminarayanan (2016) for a review), and Bayesian LFI (Thomas et al., 2016; Gutmann et al., 2018; Dinev & Gutmann, 2018; Hermans et al., 2019). Finally, like ACORE, Thornton et al. (2017) explores frequentist guarantees of confidence regions; however those regions are built under a Bayesian framework.
Novelty. What distinguishes the ACORE approach from other related work is that it uses an efficient procedure for estimating the likelihood ratio, critical values and coverage of confidence sets across the entire parameter space without the need for an extra interpolation or calibration step (as in traditional Monte Carlo solutions and more recent ML approaches). We provide theoretical guarantees on our procedures in terms of power and validity (Section 3.1; proofs in Supp. Mat. C). We also offer a scheme for how to choose ML algorithms and the number of simulations so as to have good power properties and valid inference in practice.
Notation. Let with density represent the stochastic forward model for a sample point at parameter . We denote i.i.d “observable” data from by , and the actually observed or measured data by . The likelihood function .
2 Statistical Inference in a Traditional Setting
We begin by reviewing elements of traditional statistical inference that play a key role in ACORE.
Equivalence of tests and confidence sets. A classical approach to constructing a confidence set for an unknown parameter is to invert a series of hypothesis tests (Neyman, 1937): Suppose that for each possible value , there is a level test of
(1) 
that is, a test where the type I error (the probability of erroneously rejecting a true null hypothesis ) is no larger than . For observed data , now define as the set of all parameter values for which the test does not reject . Then, by construction, the random set satisfies
for all . That is, defines a confidence set for . Similarly, we can define a test with a desired significance level from a confidence set with a certain coverage.
Likelihood ratio test. A general form of hypothesis tests that often leads to high power is the likelihood ratio test (LRT). Consider testing
(2) 
where . For the likelihood ratio (LR) statistic,
(3) 
the LRT of hypotheses (2) rejects when for some constant . Figure 1 illustrates the construction of confidence sets for from level likelihood ratio tests (1). The critical value for each such test is
3 Acore: Approximate Computation via Odds Ratio Estimation
In a likelihoodfree inference setting, we cannot directly evaluate the likelihood ratio statistic. Here we describe the details of how a simulationbased approach (ACORE, Figure 2) can lead to hypothesis tests and confidence sets with good frequentist properties.
3.1 Hypothesis Testing via Odds Ratios
We start by simulating a labeled sample for computing odds ratios. The estimated odds ratio then defines a new test statistic that we use in place of the unknown likelihood ratio statistic.
Simulating a labeled sample. Let be a distribution with larger support than for all . The distribution could for example be a dominating distribution which it is easy to sample from. We use and to simulate a labeled training sample for estimating odds ratios. The random sample is identically distributed as , where the parameters (a fixed proposal distribution over ), the “label” (a Bernoulli distribution with known with independent of ), and . That is, the label is the indicator that the sample point was generated from rather than . We call a “reference distribution” as we are comparing for different with this distribution. (Algorithm 3 in Supp. Mat. A summarizes our procedure.)
Odds ratios. For fixed , we define the odds at as
and the odds ratio at as
One way of interpreting the odds is to regard it as a measure of the chance that was generated from . That is, a large odds reflects the fact that it is plausible that was generated from (rather than ). Thus, measures the plausibility that was generated from rather than . When testing (2), we therefore reject if , for some constant . By Bayes rule, this is just the likelihood ratio test of (2).
Hypothesis testing in an LFI setting. In an LFI setting, we cannot directly evaluate the likelihood ratio statistic (3). The advantage of rewriting the LRT in terms of odds ratios is that we can forwardsimulate a labeled training sample , as described above, and then use a probabilistic classifier (suitable for the data at hand) to efficiently estimate the odds ratios for all : The probabilistic classifier compares data from the forward model with data from the reference distribution and returns a parametrized odds estimate , which is a function of . We can directly compute the odds ratio estimate at any two values from . There is no need for a separate training step.
We reject if the ACORE test statistic defined as
(4) 
is small enough for observed data . If the probabilities learned by the classifier are well estimated, is exactly the likelihood ratio statistic:
Proposition 3.1 (Fisher Consistency).
Estimating the critical value. A key question is how to efficiently estimate the critical value of a test. In this section we consider a single composite null hypothesis . (The setting for constructing confidence sets by testing (1) for all is discussed in Section 3.2.). Suppose that we reject the null hypothesis if the test statistic (4) is smaller than some constant . To achieve a test with a desired level of significance , we need (for maximum power) the largest that satisfies
(5) 
However, we cannot explicitly compute the critical value or the rejection probability as we do not know the distribution of the test statistic .
Simulationbased approaches are often used to compute rejection probabilities and critical values in lieu of largesample theory approximations. Typically, such simulations compute a separate Monte Carlo simulation at each fixed on, e.g., a fine enough grid on . That is, the convention is to rely solely on sample points generated at fixed to estimate the rejection probabilities . Here we propose to estimate the critical values for all and significance levels simultaneously. At the heart of our approach is the key observation that the rejection probability is a conditional cumulative distribution function, which for many models varies smoothly as a function of and . Thus, similar to how we estimate odds for the ACORE statistic, one can use data generated in the neighborhood of to improve estimates of our quantities of interest at any . This is what a quantile regression implicitly does to estimate .
Algorithm 1 outlines the details of the procedure for estimating . In brief, we use a training sample (independent of ) to estimate the conditional quantile defined by Let be the estimate of from a quantile regression of on . By (5), our estimate of the critical value is As we shall see, even if the odds are not well estimated, tests and confidence regions based on estimated odds are still valid as long as the thresholds are well estimated. Next we show that the sample size in Algorithm 1 controls the type I error (Theorem 3.3), whereas the training sample size for estimating odds is related to the power of the test (Theorem 3.4).
Theoretical guarantees. We start by showing that our procedure leads to valid hypothesis tests (that is, tests that control the type I error probability) as long as in Algorithm 1 is large enough. In order to do so, we assume that the quantile regression estimator used in Algorithm 1 to estimate the critical values is consistent in the following sense:
Assumption 3.2.
Let be the estimated cumulative distribution function of the test statistic conditional on based on a sample size , and let be true conditional distribution. For every , assume that the quantile regression estimator is such that
Under some conditions, Assumption 3.2 holds for instance for quantile regression forests (Meinshausen, 2006).
Next we show that, for every fixed training sample size in Algorithm 3, Algorithm 1 yields a valid hypothesis test as . The result holds even if the likelihood ratio statistic is not well estimated.
Theorem 3.3.
Finally we show that as long as the probabilistic classifier is consistent and the critical values are well estimated (which holds for large according to Theorem 3.3), the power of the ACORE test converges to the power of the LRT as grows.
Theorem 3.4.
Let be the test based on
the statistic for a labeled sample size with critical value .
where , and is such that , then, for every ,
3.2 Confidence Sets
To construct a confidence set for , we use the equivalence of tests and confidence sets (Section 2): Suppose that we for every can find the critical value of a test of (1) with type I error no larger than . The random set
then defines a confidence region for .
However, rather than repeatedly running Algorithm 1 for each null hypothesis separately, we estimate all critical values (for different ) simultaneously. Algorithm 2 outlines our procedure. Again, we use quantile regression to learn a parametrized function . The whole procedure for computing confidence sets via ACORE is summarized in Algorithm 4 in Supp. Mat. B. Theorem 3.3 implies that the constructed confidence set has the nominal confidence level as . The size of the confidence set depends on the training sample size and the classifier.
3.3 Evaluating Empirical Coverage for All Possible Values of
After the parametrized ACORE statistic and the critical values have been estimated, it is important to check whether the resulting confidence sets indeed are valid or, equivalently, if the resulting hypothesis tests have the nominal significance level. We also want to identify regions in parameter space where we clearly overcover. That is, the two main questions are: (i) do the constructed confidence sets satisfy
for every , and (ii) how close is the actual coverage to the nominal confidence level ? To answer these questions, we propose a goodnessoffit procedure where we draw new samples from the simulator given , construct a confidence set for each sample, and then check which computed regions include the “true” . More specifically: we generate a set , where and is a sample of size of i.i.d. observable data from . We then define
where is the confidence set for for data . If has the correct coverage, then
We can estimate the probability using any probabilistic classifier; some methods also provide confidence bands that assess the uncertainty in estimating this quantity (Eubank & Speckman, 1993; Claeskens et al., 2003; Krivobokova et al., 2010). By comparing the estimated probability to , we have a diagnostic tool for checking how close we are to the nominal confidence level over the entire parameter space . See Figure 3 for an example.
Finally note that our procedure parametrizes the coverage of the confidence set as a function of the true parameter value. This is in contrast to other goodnessoffit techniques (e.g., Cook et al. 2006; Bordoloi et al. 2010; Talts et al. 2018; Schmidt et al. 2020) that only check for marginal coverage, i.e., .
4 Toy Examples
We consider two examples where the true likelihood is known. In the first example, the forward model is a model similar to the signalbackground model in Section 5. In the second example, we consider a Gaussian mixture model (GMM) with two unitvariance Gaussians centered at and , respectively. In both examples, , the proposal distribution is a uniform distribution, and the reference distribution is a normal distribution. Table 1 summarizes the setup.
Poisson Model  GMM  

True 
Poisson Model  

Classifier  Cross  Average  Size of  
Entropy Loss  Power  Confidence Set [%]  
100  MLP  0.87 0.27  0.24  75.9 19.3 
NN  0.76 0.15  0.29  71.6 19.7  
QDA  0.66 0.02  0.41  60.0 15.6  
500  MLP  0.69 0.01  0.35  65.9 20.4 
NN  0.67 0.01  0.38  62.9 15.8  
QDA  0.64 0.01  0.47  54.2 9.4  
1000  MLP  0.69 0.01  0.37  63.3 19.8 
NN  0.66 0.01  0.44  56.9 15.9  
QDA  0.64 0.01  0.50  51.3 7.7  
  Exact  0.64 0.01  0.54  45.0 4.9 
GMM  

Classifier  Cross  Average  Size of  
Entropy Loss  Power  Confidence Set [%]  
100  MLP  0.39 0.03  0.88  14.1 4.7 
NN  0.81 0.31  0.42  58.4 23.3  
QDA  0.64 0.02  0.15  85.3 21.1  
500  MLP  0.35 0.01  0.90  12.1 2.4 
NN  0.45 0.05  0.57  44.3 24.1  
QDA  0.62 0.01  0.15  84.9 19.9  
1000  MLP  0.35 0.01  0.90  12.1 2.5 
NN  0.41 0.02  0.77  24.9 15.9  
QDA  0.62 0.01  0.12  88.1 18.0  
  Exact  0.35 0.01  0.92  9.5 2.0 
First we investigate how the power of ACORE and the size of the derived confidence sets depend on the performance of the classifier used in the odds ratio estimation (Section 3.1). We consider three classifiers: multilayer perceptron (MLP), nearest neighbor (NN) and quadratic discriminant analysis (QDA). For different values of (sample size for estimating odds ratios), we compute the binary cross entropy (a measure of classifier performance), the power as a function of , and the size of the constructed confidence set. Table 2 summarizes results based on 100 repetitions. (To compute the critical values in Algorithm 2, we use quantile gradient boosted trees and a large enough sample size to guarantee confidence sets; see Supp. Mat. E.) The last row of the table shows the best attainable cross entropy loss (Supp. Mat. D), the confidence set size and power for the true likelihood function.
For each setting with fixed B, the best classifier according to cross entropy loss achieves the highest power and the smallest confidence set.
Next we illustrate our goodnessoffit procedure (Section 3.3) for checking the coverage of the constructed confidence sets across the parameter space . Figure 3 shows the estimated coverage with logistic regression for the Poisson example with (the training sample size for estimating odds via QDA) and three different values of (the training sample size for estimating the critical value via gradient boosted quantile regression). As expected (Theorem 3.3), the estimated coverage gets closer to the nominal confidence level as increases. We can use these diagnostic plots to choose . For instance, here is large enough for ACORE to achieve good coverage. (See Supp. Mat. E for a detailed analysis of this example.)
5 Signal Detection in High Energy Physics
In order to apply ACORE, we need to choose four key components: (i) a probabilistic classifier, (ii) a training sample size for learning odds ratios, (iii) a quantile regression algorithm, and (iv) a training sample size for estimating critical values. We propose the following practical strategy to choose such components:
We illustrate ACORE and this strategy on a model described in Rolke et al. (2005) and Sen et al. (2009) for a high energy physics (HEP) experiment. In this model, particle collision events are counted under the presence of a background process . The goal is to assess the intensity of a signal (i.e., an event which is not part of the background process). The observed data consist of realizations of , where is the number of events in the signal region, and is the number of events in the background (control) region. (We use a uniform proposal distribution and a Gaussian reference distribution .) This model is a simplified version of a real particle physics experiment where the true likelihood function is not known.
Figure 4 illustrates the role of , , and our goodnessoffit procedure when estimating confidence sets. (For details, see Supp. Mat. F.) In the left panel, we use the true LR statistic to show that, even if the LR is available, estimating the critical value well still matters. Our goodnessoffit diagnostic provides a principled way of choosing the best quantile regression (QR) method and the best sample size for estimating . In this example, random forest QR does not pass our goodnessoffit test; it also leads to a confidence region quite different from the exact one. Deep QR, which passes our test, gives a more accurate region estimate. In the center panel, we use ACORE to estimate both the odds ratio and the critical value (this is the LFI setting). If we choose by identifying when the cross entropy loss levels off, we would choose . Decreasing leads to a worse crossentropy loss and, as the figure shows, also a larger confidence region. Increasing beyond our selected sample size does not lead to substantial gains. The right panel illustrates how our goodnessoffit procedure can be used to identify regions in parameter space where a constructed confidence set is not valid. The heat map refers to an example which did not pass our goodnessoffit procedure. While the overall (marginal) coverage is at the right value, our diagnostic procedure (for estimating coverage as a function of and ) is able to identify undercoverage in lowsignal and highbackground regions. That is, for a valid confidence set, one needs to better estimate the critical value by, e.g., using a different quantile regression estimator or by increasing (either uniformly over the parameter space or by an active learning scheme which increases the number of simulations at parameter settings where one undercovers).
6 Conclusions
In this paper we introduce ACORE, a framework for carrying out frequentist inference in LFI settings. ACORE is well suited for settings with costly simulations, as it efficiently estimates test statistics and critical values across the entire parameter space. We provide a new goodnessoffit procedure for estimating coverage of constructed confidence sets for all possible parameter settings. Even if the likelihood ratio is not well estimated, ACORE provides valid inference as long as hypothesis tests and confidence sets pass our goodnessoffit procedure (albeit at the cost of having less power and larger sets). We provide practical guidance on how to choose the smallest number of simulations to guarantee powerful and valid procedures.
Future studies will investigate active learning strategies, and the role of and in the performance of ACORE. We will also explore other strategies for choosing the number of simulations . For instance, rather than relying on loss functions, such choice can be based on testing if the likelihood function is well estimated using e.g. Dalmasso et al. (2019). Finally, we will include a theoretical study of how the power of the ACORE test relates to classifier performance.
Supplementary Material: Confidence Sets and Hypothesis Testing in a LikelihoodFree Inference Setting
Appendix A Algorithm for Simulating Labeled Sample for Estimating Odds Ratios
Algorithm 3 provides details on how to create the training sample for estimating odds ratios. Out of the total training sample size , a proportion is generated by the stochastic forward model at different parameter values , while the remainder is sampled from a reference distribution .
Appendix B Algorithm for Constructing Confidence Set for
Algorithm 4 summarizes the ACORE procedure for constructing confidence sets. First, we estimate parametrized odds to compute the ACORE test statistic (Eq. 4). Then, we compute a parametrized estimate of the critical values as a conditional distribution function of . Finally, we compute a confidence set for by the Neyman inversion technique (Neyman, 1937).
Appendix C Proofs
Proof of Proposition 3.1.
If , then . By Bayes rule and construction (Algorithm 3),
Thus, the odds ratio at is given by
and therefore
∎
Proof of Theorem 3.3.
The union bound and Assumption 3.2 imply that
It follows that
The result follows from the fact that
and thus
∎
Lemma C.1.
If and , then
Proof.
For every , it follows directly from the properties of convergence in probability that
The conclusion of the lemma follows from the continuous mapping theorem. ∎
Appendix D The Cross Entropy Loss
In this work, we use the cross entropy loss to measure the accuracy of the probabilistic predictions of the classifier. That is, we calibrate the estimated odds function as follows: Consider a sample point generated according to Algorithm 3. Let be a distribution, and be a distribution. The cross entropy between and is given by
For every and , the expected cross entropy is minimized by . Thus we can measure the performance of an estimator of the odds by the risk
The cross entropy loss is not the only loss function that is minimized by the true odds function, but it is usually easy to compute in practice. It is also well known that minimizing the cross entropy loss between the estimated distribution and the true distribution during training is equivalent to minimizing the KullbackLeibler (KL) divergence between the two distributions, as
where is the cross entropy and is the entropy of the true distribution. By Gibbs’ inequality (MacKay, 2002), we have that ; hence the entropy of the true distribution lower bounds the cross entropy with the minimum achieved when . Hence, we can connect the cross entropy loss to the ACORE statistic.
Proposition D.1.
Proof of Prop d.1.
The proof follows from Proposition 3.1 and the expected cross entropy loss is minimized if and only if . ∎
In addition, we show that the convergence of the class posterior implies the convergence of the cross entropy to the entropy of the true distribution. This supports our decision to use the cross entropy loss when selecting the probabilistic classifier and sample size .
Lemma D.2.
If for every
then
Proof of Lemma d.2.
We can rewrite the cross entropy and entropy as
In addition, for any , it also holds that . The lemma follows by combining the dominated convergence theorem with the continuous mapping theorem for the logarithm. ∎
Appendix E Toy Examples
This section provides details on the toy examples of Section 4. We use the sklearn ecosystem Pedregosa et al. (2011) implementation of the following probabilistic classifiers:

multilayer perceptron (MLP) with default parameters, but no regularization ();

quadratic discriminant analysis (QDA) with default parameters;

nearest neighbors (NN) classifier, with number of neighbors equal to the rounded square root of the number of data points available (as per Duda et al. (2001)).
Table 3 reports the observed coverage for the settings of Tables 1 and 2. Critical values or are estimated with quantile gradient boosted trees ( trees with maximum depth equal to ), a training sample size , observed data of sample size , nominal coverage of , and averaging over repetitions. The table shows that we for all cases achieve results in line with the nominal
confidence level.
Classifier  Poisson Model  GMM  
Coverage  Coverage  
100  MLP  0.91  0.87 
NN  0.91  0.91  
QDA  0.90  0.88  
500  MLP  0.91  0.91 
NN  0.93  0.95  
QDA  0.94  0.92  
1000  MLP  0.91  0.92 
NN  0.89  0.88  
QDA  0.91  0.93 
Our goodnessoffit procedure shown in Figure 3 uses a set with size (as defined in Section 3.3); Figure 5 shows the goodnessoffit plot for the Gaussian mixture model, where the coverage is estimated via logistic regression and the critical values are estimated via quantile gradient boosted trees. For the Poisson model a training sample size of seems to be enough to achieve correct coverage, whereas the Gaussian mixture model requires .
Next we compare our goodnessoffit diagnostic with diagnostics obtained via standard Monte Carlo sampling. Figure 6 shows the MC coverage as a function of for the Poisson (left) and the Gaussian mixture model (right). In both cases 100 MC samples are drawn at 100 parameter values chosen uniformly. The empirical ACORE coverage is computed over the MC samples at each chosen . This MC procedure is expensive: it uses a total of simulations, which is times the number used in our goodnessoffit procedure. The observed coverage of the Poisson model (Figure 6, left) indicates that is sufficient to achieve the nominal coverage of . For the Gaussian mixture model (Figure 6, right), we detect undercoverage for very small values of . This discrepancy is due to the fact that, at , the mixture collapses into a single Gaussian, structurally different from the model at any other and closer to the reference distribution.
Our goodnessoffit procedure is able to identify that the actual coverage is far from the nominal coverage at small values of , when the training sample size for estimating is too small. More specifically, Figure 5 shows a noticeable tilt in the prediction bands for and . However, as increases, the estimation of critical values becomes more precise and the model passes our goodnessoffit diagnostic at, for example, . Future studies will provide a more detailed account on how such boundary effects depend on the method for estimating the coverage.
Appendix F Signal Detection in High Energy Physics
Here we consider the signal detection example in Section 5 We detail the construction of ACORE confidence sets and the strategy in Section 5 for choosing ACORE components and parameters. For learning the odds ratio, we compared the following classifiers:

logistic regression,

quadratic discimininant analysis (QDA) classifier,

nearest neighbor classifier,

gradient boosted trees using trees with maximum depth ,

Gaussian process classifiers
^{5} with radial basis functions kernels with variance , 
feedforward deep neural networks, with deep layers, number of neurons between and either ReLu or hyperbolic tangent activations.
For estimating the critical values, we considered the following quantile regression algorithms:

gradient boosted trees using trees with maximum depth ,

random forest quantile regression with trees,

deep quantile regression with deep layers, neurons and ReLu activations (using the PyTorch implementation Paszke et al. (2019)).
All computations were performed on 8Core Intel Xeon CPUs X5680 at 3.33GHz servers.
Figure 7 illustrates the two steps in identifying the four components of ACORE. We first use a validation set of simulations to determine which probabilistic classifier and training sample size minimize the cross entropy loss. Figure 7 (left) shows the cross entropy loss of the best four classifiers as function of . The minimum is achieved by a 5layer deep neural network (DNN) at with a cross entropy loss of , closely followed by QDA with at . We follow up on both classifiers. In Figure 7 (right), the “estimated correct coverage” represents the proportion of the parameter space that passes our diagnostic procedure. The lowest with correct coverage is achieved by the fivelayer DNN classifier (for estimating odds ratios) at , with critical values estimated via a twolayer deep quantile regression algorithm. None of the quantile regression algorithms pass a diagnostic test with a nominal coverage of at the one standard deviation level when using the QDA classifier. We therefore do not use QDA in Section 5.
Based on the analysis above, we choose the following ACORE components: (i) a fivelayer DNN for learning odds ratios, (ii) , (iii) a twolayer deep quantile regression for estimating critical values, and (iv) . Figure 4 shows the confidence sets computed with this choice.
Footnotes
 That is,