We propose a general method for constructing hypothesis tests and confidence sets that have finite sample guarantees without regularity conditions. We refer to such procedures as “universal.” The method is very simple and is based on a modified version of the usual likelihood ratio statistic, that we call “the split likelihood ratio test” (split LRT). The method is especially appealing for irregular statistical models. Canonical examples include mixture models and models that arise in shape-constrained inference. Constructing tests and confidence sets for such models is notoriously difficult. Typical inference methods, like the likelihood ratio test, are not useful in these cases because they have intractable limiting distributions. In contrast, the method we suggest works for any parametric model and also for some nonparametric models. The split LRT can also be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid -values and confidence sequences.
Universal Inference using the Split Likelihood Ratio Test
Larry Wasserman, Aaditya Ramdas, Sivaraman Balakrishnan
Department of Statistics and Data Science, Carnegie Mellon University
The foundations of statistics are built on a variety of generally applicable principles for parametric estimation and inference. In parametric statistical models, the likelihood ratio test and confidence intervals obtained from asymptotically Normal estimators, are the workhorse inferential tools for constructing hypothesis tests and confidence intervals. Often, the validity of these methods relies on large sample asymptotic theory and requires that the statistical model satisfy certain conditions, known as regularity conditions; see Section 3 for a precise definition. When these conditions do not hold, there is no general method for statistical inference. In this paper, we introduce a general method which yields tests and confidence sets for any statistical model and has finite sample guarantees.
We begin with some terminology. A parametric statistical model is a collection of distributions for an arbitrary set . When the aforementioned regularity conditions hold, there are many methods for inference. For example, suppose that and let be the upper quantile of a distribution. The set
is the likelihood ratio confidence set and satisfies the coverage guarantee
as , where denotes the unknown true parameter, is the likelihood function and is the maximum likelihood estimator (MLE).
Constructing tests and confidence intervals for irregular models — where the regularity conditions do not hold — is very difficult. An example is mixture models. In this case we observe and we want to test
where denotes the set of mixtures of Gaussians and . Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is the likelihood ratio statistic but this turns out to have an intractable limiting distribution (Dacunha-Castelle and Gassiat (1997)). As we discuss further in Section 4, developing practical, simple tests for this pair of hypotheses is an active area of research (see for instance (Chen and Li, 2009; McLachlan, 1987; Chakravarti et al., 2019) and references therein). In this note, we show that there is a remarkably simple test with guaranteed finite sample control of the type I error. Similarly, we construct a confidence set with guaranteed finite sample coverage. These tests and confidence sets can in fact be used for any model. But in well-behaved statistical models they will not be optimal and thus are most useful in irregular models where optimality may be less of a concern. Going beyond parametric models, we briefly also show that our methods can also be used for some nonparametric models as well, and has a sequential analog.
2 Universal Inference
Let be an iid sample from a distribution from a family . Note that denotes the true value of the parameter. We assume that each distribution has a density with respect to Lebesgue measure. Split the data into two groups and . For simplicity, we take each group to be the same size but this is not necessary. Let be an estimator constructed from . This can be any estimator: the MLE, a Bayes estimator that utilizes prior knowledge, a robust estimator, etc. Let denote the likelihood function based on .
We define the split likelihood ratio statistic (split LRS) as
Then, the universal confidence set is
is a finite-sample valid confidence set for , meaning that
If we did not split the data, then would be the usual likelihood ratio statistic and we would typically approximate its distribution using an asymptotic argument. For example, as mentioned earlier, in regular models, -2 times the log likelihood ratio statistic has, asymptotically, a distribution. But, in irregular models this strategy can fail. Indeed, finding or approximating the distribution of the likelihood ratio statistic is highly nontrivial in irregular models. The split LRS avoids these complications.
Now we explain why has coverage at least , as claimed by the above theorem. Consider any fixed and let denote the support of . Then, we have
Since is fixed when we condition on , this implies that
Now, using Markov’s inequality,
This completes a simple proof of the theorem. Now we turn to hypothesis testing. Let be a composite null set and consider testing
The alternative above can be replaced by for any or by . One way to test this hypothesis is based on the universal confidence set in (4). We simply reject the null hypothesis if It is straightforward to see that if this test makes a type I error then the universal confidence set must fail to cover , and so the type I error is at most .
We present an alternative method that is often computationally more attractive. As before, let be any estimator constructed from , and let
be the MLE under constructed from . Then the universal test, which we call the split likelihood ratio test (split LRT), is defined as:
The split LRT controls the type I error at level , i.e. .
The proof of the above theorem is straightforward. Suppose that is true and let denote the true parameter. By Markov’s inequality, the type I error is
Here, inequality (i) uses the fact that which is true when is the MLE, and inequality (ii) follows by conditioning on as argued earlier in (6).
We call these procedures universal to mean that they are valid in finite samples with no regularity conditions. Constructions like this are reminiscent of ideas used in sequential settings where an estimator is computed from past data and the likelihood is evaluated on current data; see, for example, Darling and Robbins (1967). We will expand on this connection later.
We note in passing that another universal method is the following. Define
where is the full likelihood (from all the data) and is any prior. This is also a universal method but requires specifiying a prior and doing an integral. In irregular models the integral will typically be intractable.
3 Sanity Check: Regular Models
Although universal methods are not needed for well-behaved models, it is worth checking their behavior in these cases. We would expect that the confidence set would not have optimal length but we would hope that it still shrinks at the optimal rate. We now confirm that this is true.
The following are standard definitions that can be found, for example, in Van der Vaart (2000). The model is differentiable in the quadratic mean (DQM) at if there exists a function such that
as , where denotes the Lebesgue measure (more generally, the common measure with respect to which the densites and are defined). Assume that is an open set and that the model is DQM at each point. Then and the Fisher information exists. We then have (see for instance Theorem 7.2 of Van der Vaart (2000)) that, for any ,
Now assume that is the true parameter, and which is true for many estimators in regular models. We also assume that is a smooth function of (bounded third derivatives) of for all . The optimal confidence set has diameter .
We show the same is true of the universal set. Consider any such that for some . We will show that, with high probability, . Now
The log of the first term is bounded in probability by (9). By a Taylor expansion, where is the Kullback-Leibler distance. We use the notation to denote inequalities which are true up to some universal positive constant. Furthermore, we have that,
where is the sample Kullback-Leibler distance given by
Now, assuming that in a neighborhood of , . Therefore,
in probability so that with probability tending to 1. Thus for regular models, the universal confidence set also has diameter .
Mixture Models. As a proof-of-concept, we do a small simulation to check the type I error and power for mixture models. Specifically, let where . We want to distinguish the hypotheses in (1). For this brief example, we take and .
Finding a test that provably controls the type I error at a given level has been elusive. A natural candidate is the likelihood ratio statistic but, as mentioned earlier, this has an intractable limiting distribution. To the best of our knowledge, the only practical test for the above hypothesis with a tractable limiting distribution is the EM test due to Chen and Li (2009). This very clever test is similar to the likelihood ratio test except that it includes some penalty terms and requires the maximization of some of the parameters to be restricted. However, the test requires choosing some tuning parameters and, more importantly, it is restricted to one-dimensional problems. There is no known confidence set for mixture problems with guaranteed coverage properties. Another approach is the bootstrap (McLachlan, 1987) but there is no proof of the validity of the bootstrap for mixture models.
Figure 1 shows the power of the test when and is the MLE under the full model . The true model is taken to be where is a Normal density with mean and variance 1. The null corresponds to . We take and the MLE is obtained by the EM algorithm, which we assume converges on this simple problem. Understanding the local and global convergence (and non-convergence) of the EM algorithm to the MLE is an active research area (see for instance (Balakrishnan et al., 2017; Xu et al., 2016; Jin et al., 2016) and references therein), but is beyond the scope of this paper. As expected, the test is conservative with type I error near 0. However, the test begins to show reasonable power when .
Figure 1 also shows the power of the bootstrap test (McLachlan (1987)). Here, the -value is obtained by by bootstrapping the likelihood ratio test statistic under the estimated null distribution. As expected, this has higher power than the universal test since it does not split the data, but unfortunately this test does not have any guarantee on the type I error, even asymptotically. The lower power of the universal test is the price paid for having a finite sample guarantee. It is worth noting that the bootstrap test requires running the EM algorithm for each bootstrap sample while the universal test only requires one EM run.
Nonparametric Example: Shape Constrained Inference. A density is log-convave if for some concave function . Consider testing is log-concave versus is not log-concave. Let be the set of log-concave densities and let denote the nonparametric maximum likelihood estimator over using (Axelrod et al. (2019); Cule et al. (2010)) which can be computed in polynomial time. Let be any nonparametric density estimator such as the kernel density estimator (Silverman (2018)) fit on . In this case, the universal test is to reject when
To the best of our knowledge this is the first test for this problem with finite sample guarantee. Under the assumption that , the universal confidence set is
While the aforementioned test can be efficiently performed, the set may be hard to explicitly represent, but its membership can be queried efficiently (since that amounts to doing a test). We also point out that these two examples can be combined: the universal test can be used for mixtures of log-concave densities to choose the number of components. However, as with the mixtures of Gaussians example, the convergence of the EM algorithm to the MLE for a mixture of log-concave densities needs to be established in order to conclude that the universal methods are valid.
Sieves. A sieve (Shen and Wong (1994)) is a sequence of nested models . Sieves are a general approach to nonparametric inference. If we assume that the true density is in for some (unknown) then universal testing can be used to choose the model. One possibility is to test one by one for . We reject if
where is the MLE in model . Then we take to be the first such that is not rejected, and proclaim that for some . Even though we test multiple different hypotheses and stop at a random , this procedure still controls the type I error, meaning that
meaning that our proclamation is correct with high probability. The reason we do not need to correct for multiple testing is because a type I error can occur only once we have reached the first such that .
Independence versus Conditional Independence. Guo and Richardson (2019) consider the following problem. The data are trivariate vectors of the form which are modelled as trivariate Normal. The goal is to test and are independent versus and are independent given . The motivation for this test is that this problem arises in the construction of causal graphs. It is suprisingly difficult to test these non-nested hypotheses. Indeed, Guo and Richardson (2019) study carefully the subtleties of the problem and they show that the limiting distribution of the likelihood ratio statistic is complicated and cannot be used for testing. They propose a new test based on a concept called envelope distributions. Despite the fact that the hypotheses are non-nested the universal test is applicable and can be used quite easily for this problem. We leave it to future work to compare the power of the universal test and the envelope test.
Uniform Distribution. All methods have limitations of course. Now we give an example where the universal method gives confidence sets that have poor behavior. Suppose that is the uniform density on . Let us take to be the MLE from . Thus, is the maximum of the data points in . Now where is the maximum of the data points in . It follows that whenever which happens with probability 1/2. The set has the required coverage but is too large to be useful. This happens because the densities have different support. One can partially avoid this behavior by choosing to not be the MLE, but some other (more conservative but more robust) estimator. However, this is a case where we would not recommend the universal method since there are simple, efficient confidence intervals for this model.
The universal method involves randomly splitting the data and the final inferences will depend on the randomness of the split. We can get rid of the randomness, at the cost of more computation, by using many splits. They key property that we used in both the universal confidence set and the split LRT is that where . Imagine that we obtained such statistics with the same property: . Let
Then we still have that and so inference using our universal methods can proceed using the combined statistic . Note that this is true regardless of the dependence between the statistics.
Using the aforementioned idea we can immediately design natural variants of the universal method:
Cross-fitting. In the universal method, we split the data only once into two parts, but which part we called versus is arbitrary. So we can run the universal method once, then swap the roles of and and run the method again, and average the two statistics.
K-fold. There is no reason to restrict the cross-fitting idea to two folds. We may split the data once into folds. Then repeat the following times: use folds to calculate , and evaluate the likelihood ratio on the last fold. Finally, average the statistics. Alternatively, we could also use one fold to calculate and evaluate the likelihood on the other folds.
Subsampling. We do not need to split the data just once into folds. We can repeat the previous procedure for repeated random splits of the data into folds. We might expect this averaging to reduce variance that arises from the randomness associated with splitting.
We remark that all these variants allow a large amount of flexibility. For example, in cross-fitting, need not be used the same way in both splits: it could be the MLE on one split, but a Bayesian estimator on another split. This flexibility could be useful if the user does not know which variant would lead to higher power in advance and would like to hedge across multiple natural choices. Similarly, in the -fold version, if a user is confused whether to evaluate the likelihood ratio on one fold or on folds, then they can do both and average the statistics.
Of course, with such flexibility comes the risk of an analyst cherry-picking the variant used after looking at the which form of averaging results in the highest likelihood ratio (this would correspond to taking the maximum instead of the average of multiple variants), but this is a broader issue. For this reason (and this reason alone), the cross-fitting variant may be a useful default in practice, since it is both conceptually simple and computationally the cheapest alternative.
We leave a detailed analysis of the power of these variants to future work.
Profile likelihood and nuisance parameters.
Suppose that we are interested in some function . Let . By construction, is a confidence set for . Further, we can define the profile likelihood function
and rewrite as
In other words, the same data splitting idea works for the profile likelihood too. As a particularly useful example, suppose is partitioned into a useful component and a nuisance component , then we can define to obtain a universal confidence set for only the component we care about.
Robustness via power likelihoods.
It has been suggested by some authors — see, for example Royall and Tsou (2003); Miller and Dunson (2019); Holmes and Walker (2017); Grünwald and Van Ommen (2017) — that inferences can be made robust by replacing the likelihood with the power likelihood for some . If we do this, then we see that
and hence all the aforementioned methods can be used with the robustified likelihood as well. (The last inequality follows because the -Renyi divergence is nonnegative.)
Conditional likelihood for non-i.i.d. data.
Our presentation so far has assumed that the data are drawn i.i.d. from some distribution. However, this is not really required, and was only assumed for simplicity of presentation. All that is really needed is that we can calculate the likelihood on conditional on . One example where this could be both useful and tractable is in models involving sampling without replacement from an urn with balls. Here could represent the unknown number of balls of different colors. Such hypergeometric sampling schemes obviously result in non-i.i.d. sampling, but conditional on the one subset of data (for example how many red, green and blue balls were sampled from the urn in that subset), one can still evaluate the conditional likelihood of the second half of the data (and maximize it), rendering it possible to apply our universal tests and confidence sets.
7 Sequential testing, anytime -values and confidence sequences
Just like the sequential likelihood ratio test by Wald (1945), the split LRT has a simple sequential extension. Similarly, the confidence set can be extended to a “confidence sequence” (Darling and Robbins (1967)). We describe the former first because researchers are usually more familiar with sequential testing.
Group sequential split LRT. Suppose the original split LRT (8) failed to reject the null (7). Then, we are allowed to do the following. Collect some more data from the same distribution. Set and , and re-run the split LRT at level on and . If the test rejects then stop, and otherwise repeat the same procedure with more new data , and so on with no limit.
Note that the above test involves recalculating and multiple times, but does not involve any correction for repeated testing. This result follows immediately from the following non-grouped “running MLE” version of this test.
The running MLE sequential LRT. Consider the following, more standard, sequential testing (or estimation) setup. We observe a data sequence drawn i.i.d. from . We would like to test the hypothesis (7). Let , and let be two MLE sequences calculated under the alternative and null respectively. At any time , we reject the null and stop whenever
This test is computationally expensive: we must recalculate and at each step. In some cases, these may be quick to calculate by warm-starting from and (for example, the updates can be done in constant time for exponential families). However, even in these cases, the denominator takes time to recompute at step . The following result shows that with probability at least , this test will never stop under the null. Formally, let be the stopping time of the above test when the data is drawn from , which is finite only if we stop and reject the null.
The running MLE LRT has type I error at most , meaning that .
The proof involves the simple observation that under the null, is upper bounded by a nonnegative martingale with initial value one. To see this, first define the process starting with and
Note that under the null, we have because and both belong to , but the former maximizes the null likelihood (denominator). Further, it is easy to verify that is a martingale with respect to the natural filtration . Indeed,
where the last equality uses the same simple integration as earlier parts of the paper, such as in (5). To complete the proof, we note that the type I error of the running MLE LRT is simply bounded as
where the last step follows by Ville’s inequality (a time-uniform version of Markov’s inequality) for nonnegative supermartingales.
Instead of specifying a level test, we can also get a -value that is uniformly valid over time. Specifically, both and may serve as -values.
For any random time , not necessarily a stopping time, for .
The aforementioned property is equivalent to the statement that under the null , and its proof follows by substitution immediately from the previous argument. Naturally , but from the perspective of designing a level test they are equivalent, because the first time that falls below is also the first time that falls below . The term “anytime-valid” is used because, unlike typical -values, these are valid at (data-dependent) stopping times, or even random times chosen post-hoc. Hence, they are robust to “peeking” and optional stopping, and can be extended indefinitely. They can also be inverted to yield confidence sequences, as we describe below.
A confidence sequence is a sequence of confidence intervals that are valid uniformly over time. In the same setup as above, but without requiring a null set , define the running MLE likelihood ratio process
Then, a confidence sequence for is given by
In fact, we can also take the running intersection, .
is a confidence sequence for , meaning that .
The aforementioned property is equivalent to the requirement that at any data-dependent time , with probability at least . The proof is straightforward: for some if and only if for some . Hence, the error event is equivalent to
where the last step uses, as before, Ville’s inequality for the martingale .
It is worth remarking that confidence sequences are dual to anytime -values, just like confidence intervals are dual to standard -values, in the sense that a confidence sequence can be formed by inverting a family of level sequential tests (each testing a different point in the space), and a level sequential test for a composite null set can be obtained by checking if the confidence sequence intersects the null set .
In fact, our constructions of and (without running minimum and intersection) obey the same property: only if , and the reverse implication follows if is closed. To see the forward implication, assume for the purpose of contradiction that there exists some element . Since , we have . Further, since , we must have . This last condition can be restated, by definition, as , which means that .
Instead of a level test, it is also possible to get an anytime -value from a family of confidence sequences at different . Specifically, we can define as the smallest for which the confidence set intersects the null set .
All the extensions from Section 6 extend immediately to the sequential setting. One can handle nuisance parameters using profile likelihoods; this for example leads to sequential -tests (for the Gaussian family, with the variance as a nuisance parameter), which also yield confidence sequences for the Gaussian mean with unknown variance. Non-i.i.d. data, such as in sampling without replacement, can be handled using conditional likelihoods, and robustness can be increased with power likelihoods. In these situations, the corresponding underlying process will not be a martingale, but a supermartingale.
Such confidence sequences have been developed under very general nonparametric, multivariate, matrix and continuous time settings using generalizations of the aforementioned supermartingale technique; see Howard et al. (2018, 2018); Howard and Ramdas (2019). The connections (on the testing front) between anytime-valid -values, safe tests, peeking, and optional stopping and continuation has been explored recently by Johari et al. (2017); Grunwald et al. (2019); Shafer et al. (2011). However, in our opinion the confidence sequences approach is more elegant (when possible) since it avoids specification of a null set and can be used to test different composite nulls, and this entire line of research owes its origins to work done over 50 years ago by Robbins, Darling, Lai and Siegmund (Darling and Robbins, 1967; Robbins and Siegmund, 1972; Robbins, 1970; Robbins and Siegmund, 1974; Lai, 1976, 1976).
The potential practical applications are numerous, and much work remains to be done to understand the efficiency of these universal sequential tests and confidence sequences.
To summarize, inference based on the split likelihood ratio statistic leads to simple tests and confidence sets with finite sample guarantees. As we mentioned earlier, the methods are probably most useful in problems where standard asymptotic methods are difficult or impossible to apply.
Going forward, it would be very useful to have detailed simulation studies in a variety of models to study the power of the test and the size of the confidence sets, accompanied by studying its efficiency in special cases. We do not expect the test to be rate optimal in all cases, but it might have analogous properties to the generalized LRT. It would also be interesting to extend these methods (like the profile likelihood variant) to semiparametric problems where there is a finite dimensional parameter of interest and an infinite dimensional nuisance parameter. One possibility here is to use sieves. However, there are many technical difficulties that would need to be addressed.
- A polynomial time algorithm for log-concave maximum likelihood via locally exponential families. In Advances in Neural Information Processing Systems 32, pp. 7721–7733. Cited by: §4.
- Statistical guarantees for the EM algorithm: from population to sample-based analysis. Ann. Statist. 45 (1), pp. 77–120. Cited by: §4.
- Gaussian mixture clustering using relative tests of fit. arXiv preprint arXiv:1910.02566. Cited by: §1.
- Hypothesis test for normal mixture models: the em approach. The Annals of Statistics 37 (5A), pp. 2523–2542. Cited by: §1, §4.
- Maximum likelihood estimation of a multi-dimensional log-concave density. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (5), pp. 545–607. Cited by: §4.
- Testing in locally conic models, and application to mixture models. ESAIM: Probability and Statistics 1, pp. 285–317. Cited by: §1.
- Confidence sequences for mean, variance, and median. Proceedings of the National Academy of Sciences of the United States of America 58 (1), pp. 66. Cited by: §2, §7, §7.
- Safe Testing. arXiv:1906.07801. Cited by: §7.
- Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12 (4), pp. 1069–1103. Cited by: §6.
- On Testing Marginal versus Conditional Independence. arXiv e-prints, pp. arXiv:1906.01850. External Links: Cited by: §4.
- Assigning a value to a power likelihood in a general bayesian model. Biometrika 104 (2), pp. 497–503. Cited by: §6.
- Sequential estimation of quantiles with applications to a/b-testing and best-arm identification. arXiv preprint arXiv:1906.09712. Cited by: §7.
- Exponential line-crossing inequalities. arXiv:1808.03204. Cited by: §7.
- Uniform, nonparametric, non-asymptotic confidence sequences. External Links: Cited by: §7.
- Local maxima in the likelihood of gaussian mixture models: structural results and algorithmic consequences. In Advances in Neural Information Processing Systems 29, Cited by: §4.
- Peeking at A/B Tests: Why it matters, and what to do about it. pp. 1517–1525. Cited by: §7.
- On Confidence Sequences. The Annals of Statistics 4 (2), pp. 265–280. Cited by: §7.
- Boundary Crossing Probabilities for Sample Sums and Confidence Sequences. The Annals of Probability 4 (2), pp. 299–312. Cited by: §7.
- On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Journal of the Royal Statistical Society: Series C (Applied Statistics) 36 (3), pp. 318–324. Cited by: §1, §4, §4.
- Robust bayesian inference via coarsening. Journal of the American Statistical Association 114 (527), pp. 1113–1125. Cited by: §6.
- A class of stopping rules for testing parametric hypotheses. Cited by: §7.
- The Expected Sample Size of Some Tests of Power One. The Annals of Statistics 2 (3), pp. 415–436. Cited by: §7.
- Statistical Methods Related to the Law of the Iterated Logarithm. The Annals of Mathematical Statistics 41 (5), pp. 1397–1409. Cited by: §7.
- Interpreting statistical evidence by using imperfect models: robust adjusted likelihood functions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (2), pp. 391–404. Cited by: §6.
- Test Martingales, Bayes Factors and p-Values. Statistical Science 26 (1), pp. 84–101. Cited by: §7.
- Convergence rate of sieve estimates. The Annals of Statistics, pp. 580–615. Cited by: §4.
- Density estimation for statistics and data analysis. Routledge. Cited by: §4.
- Asymptotic statistics. Vol. 3, Cambridge university press. Cited by: §3.
- Sequential Tests of Statistical Hypotheses. Annals of Mathematical Statistics 16 (2), pp. 117–186. Cited by: §7.
- Global analysis of expectation maximization for mixtures of two gaussians. In Advances in Neural Information Processing Systems 29, Cited by: §4.