Generalizations related to the Posterior distribution of the LR

# Generalizations related to hypothesis testing with the Posterior distribution of the Likelihood Ratio.

I. Smith Laboratoire des Sciences du Climat et de l’Environnement ; IPSL-CNRS, France.
Université de Nice Sophia-Antipolis, CNRS, Observatoire de la Côte d’Azur, France.
and  A. Ferrari Université de Nice Sophia-Antipolis, CNRS, Observatoire de la Côte d’Azur, France
###### Abstract.

The Posterior distribution of the Likelihood Ratio (PLR) is proposed by Dempster in 1974 for significance testing in the simple vs composite hypotheses case. In this hypotheses test case, classical frequentist and Bayesian hypotheses tests are irreconcilable, as emphasized by Lindley’s paradox, Berger & Selke in 1987 and many others. However, Dempster shows that the PLR (with inner threshold 1) is equal to the frequentist p-value in the simple Gaussian case. In 1997, Aitkin extends this result by adding a nuisance parameter and showing its asymptotic validity under more general distributions. Here we extend the reconciliation between the PLR and a frequentist p-value for a finite sample, through a framework analogous to the Stein’s theorem frame in which a credible (Bayesian) domain is equal to a confidence (frequentist) domain.

This general reconciliation result only concerns simple vs composite hypotheses testing. The measures proposed by Aitkin in 2010 and Evans in 1997 have interesting properties and extend Dempster’s PLR but only by adding a nuisance parameter. Here we propose two extensions of the PLR concept to the general composite vs composite hypotheses test. The first extension can be defined for improper priors as soon as the posterior is proper. The second extension appears from a new Bayesian-type Neyman-Pearson lemma and emphasizes, from a Bayesian perspective, the role of the LR as a discrepancy variable for hypothesis testing.

###### Key words and phrases:
hypothesis testing, PLR, p-value, likelihood ratio, frequentist and Bayesian reconciliation, Lindley’s paradox, invariance, Neyman-Pearson lemma
Part of this work was published in smith14

## 1. Introduction

### 1.1. Classical hypotheses test methodologies

Simple versus composite hypotheses testing is a general statistical issue in parametric modeling. It consists for a given observed dataset in choosing among the hypotheses

 (1) \textscH0:θ=θ0     \textscH1:θ∈Θ1

where the distribution of is characterized by the underlying unknown parameter . Under the alternative hypothesis , takes a value different from the point , and the uncertainty of is described by a prior probability density function which is positive only for . We assume that the data model has the same expression under and .

To choose among and , a test statistic (such as the Generalized Likelihood Ratio) is generally compared to a threshold and one decides to choose if is greater than . If is chosen whereas the true underlying was equal to , a type I error is made in the decision. Under the classical Neyman paradigm (see neyman33; neyman77), the threshold is chosen so that the probability of the type I error lies under (or is equal to) some fixed level , typically a 5% error rate. Instead of inverting this function, a p-value can be defined in order to serve as the test statistic to be directly compared to the 5% level (lehmann05):

 (2) pval(T(x0))=Pr(T(x0)

where is the observed dataset and the variable of integration. Note that with this notation, is rejected when is greater than some threshold.

On the Bayesian side, the test statistic classically used (robert07) is the Bayes Factor (BF) defined by

 \textscBF(x)=p(x|θ0)∫dθ p(x|θ)π1(θ)

Making a binary decision consists of choosing if is greater than some threshold, and the choice of the threshold is made in general by a straight interpretation of the BF. The Jeffreys’s scale for example states that if the observed BF is between 10 and 100 there is a strong evidence in favor of . The mere posterior probability of an hypothesis may also be considered by itself.

A practical issue of the BF in the simple vs composite hypotheses test is that it is defined up to a multiplicative constant if the prior is improper111 is called improper if its integral over is infinite, which occurs if is constant over an unbounded domain for example. even though the posterior distribution is proper. Partial BFs account for this issue by somehow using part of the data to update the prior into a proper posterior, and then use this posterior as the prior for the rest of the data. The most simply defined Partial BF is the Fractional BF (FBF) proposed by ohagan95.

A related and more fundamental issue is Lindley’s paradox, initially studied by jeffreys61 and called a paradox by lindley57, which shows among others that, when testing a simple vs a composite hypothesis, the null hypothesis is too highly favoured against for a natural diffuse prior under . More precisely, for example in the test of the mean of a Gaussian likelihood, the p-value defines the uniformly most powerful test, which is a very strong optimal property even according to at least part of the Bayesian community. However, for a fixed prior and some dataset that adjusts so that the associated classical p-value remains fixed (so that the evidence for shall not change), tends to 1 as the sample size increases. This issue, intensively discussed and developed (see tsao06 for a quite recent study), is consensually considered as a real trouble by a quite large part of the community. Unlike the BF, other tests like the FBF or the bernardo10 test do not suffer from this problem in Lindley’s frame. Other ideas have been developed which prevent Lindley’s frame from occurring, avoiding troubles for the BF. berger87 for example argue that testing a simple hypothesis is an unreasonable question. Some other references will be given in the section 2.1.

Among many frequentist and Bayesian p-values (several are listed by robins00), the next most classical Bayesian-type hypotheses test statistic is the posterior predictive p-value, highlighted by meng94. Unlike the BF which only integrates over the parameter space , the posterior predictive p-value integrates over the data space , like frequentist p-values. But unlike the frequentist p-value which integrates under the frequentist likelihood , it integrates under the predictive likelihood where is the observed dataset. In a frequentist p-value only a statistic (ie a function of only) can define the domain of integration. On the contrary, in the posterior predictive p-value, a discrepancy variable (function of both and ) can be used to define the domain of integration. Note that the choice of the discrepancy variable to use there remains an issue.

Although a bit less classical, the approach of evans97 needs to be introduced because the tool and some of its properties are interesting and closely related to the ones derived in this paper. In the simple vs composite test case presented up to now, the tool proposed by evans97 and the ones studied in this paper are even mathematically equal. But the tool proposed by evans97 is defined to test more generally for a parameter of interest . The test statistic consists of measuring the Observed Relative Surprise (ORS) related to the hypotheses by computing:

 (3) \textscORS(x)=Pr(πΨ(Ψ(θ)|x)πΨ(Ψ(θ))≥πΨ(ψ0|x)πΨ(ψ0)∣∣∣x)

The relative belief ratio of defined by is measuring the change in belief in being the true value of from a priori to a posteriori. So if we have evidence in favor of . Relative belief ratios are discussed in baskurt13 where is presented as the evidence for or against and (3) is presented as a measure of the reliability of this evidence. This leads to a possible resolution of Lindley’s paradox as the relative belief ratio can be large and ORS small without contradiction. See Example 4 of baskurt13 and note that evans97 shows that ORS converges to the classical p-value as the prior becomes more diffuse in this example.

### 1.2. Posterior distribution of the Likelihood Ratio (PLR)

Let’s focus again on the simple vs composite hypothesis test. Contrary to the posterior predictive p-value, the Posterior distribution of the Likelihood Ratio (PLR) does not integrate over some data which are unobserved, but only integrates over . It still conditions upon the only observed variable, namely , like for the BF, but on a domain defined from a divergence variable, like the posterior predictive p-value. This statistic proposed by dempster74 is defined by

 (4) \textscPLR(x,ζ)=Pr(\textscLR(x,θ)≤ζ∣∣x)

where is the Likelihood Ratio

 \textscLR(x,θ)=p(x|θ0)p(x|θ)            θ∈Θ1

Since is random, the deterministic function evaluated at the random variable becomes naturally random with some posterior distribution characterized by its cumulative distribution, the PLR. As emphasized by birnbaum62, dempster74 and royall97, the threshold which compares the original likelihoods under and under is directly interpretable and can be chosen the same way an error level is chosen in the Neyman-Pearson paradigm. “” for example reads “The probability that the likelihood of is more than the likelihood of is 0.1.”.

The PLR can therefore be used for a binary decision, by fixing and deciding to reject if is greater than, say, 0.9. One can check if the binary decision is sensitive to the choice of both thresholds by making the test for several thresholds and see if the decision is different. In the extreme case, note that due to the nice definition of the PLR, one can simply display as a function of to get a broad view. The range of under which grows typically from 0.2 to 0.8 indicates if the decision for or is clear, or not. As soon as the posterior can be sampled, these computations and graphs are very easy to display as will be explained later.

The PLR has been first proposed by dempster74; dempster97, then studied especially by aitkin97 and aitkin10 but also used and analyzed by aitkin05; aitkin09. As mentioned in the previous subsection, it turns out that the PLR is also closely related to the ORS proposed by evans97, which generalizes the PLR. The PLR is also closely related to the e-value associated to the Full Bayesian Significance Test (FBST) from pereira99 and slightely revisited by borges07 which then somehow generalizes the PLR by adding a reference distribution on , and by systematically dealing with the case where the null hypothesis domain has a dimension less than but which is not necessarily restricted to the point . We do not list the results found by these different analyses, apart from some specifically mentioned ones.

The PLR turns out to be a natural Bayesian measure of evidence of the studied hypotheses since it involves only the posterior distribution of (no integral over ) and the likelihood, claimed by birnbaum62, royall97 and others, to be the only tool that can measure evidence. Unlike the BF, the PLR is well defined for an improper prior as soon as the posterior is proper, and is not subject to Lindley’s paradox. It is also invariant under any isomorphic transformation of the space and any transformation of the space, as a consequence of being a mere function of the likelihood. These last properties were emphasized for example for the e-value associated to the FBST.

The PLR also consists in a natural alternative to the BF in different regards. To start with, the PLR first compares (compares and ) and then integrates, whereas the BF first integrates and then compares (compares and ). Second, newton94 and many others show that if the prior under is proper, the BF is simply the posterior mean of the LR, ie the mean of the distribution described by the PLR222Alternatively, note that if we had defined the BF and LR with the alternative hypothesis at the numerator of these fractions, the BF would have been the prior mean of the LR.. However a point estimate is in general not given alone but accompanied by an uncertainty indicator. smith10d show that the posterior mean of the LR raised at some power is equal to the FBF introduced previously; the mean of the PLR is given by the BF and its variance is easily related to the FBF. However, smith10e shows that the Generalized Likelihood Ratio bounds the support (values of for which ) of the PLR and that at this lower bound the PLR in general starts by an infinite derivative. In addition to this theoretical result, numerical examples also indicate that the posterior density function of the LR is in general highly asymmetric. Therefore, the BF (point estimate of the LR) or any standard centered credible intervals do not appear to be relevant inferences about the LR seen as random variable. Instead, the same way the BF is to be thresholded, the actual information about which seems to be relevant, and invariant under the transformation , is to indicate its cumulative posterior distribution, which is precisely the PLR.

In practice, the PLR can be straightforwardly computed as soon as the posterior distribution is sampled. Just obtain from a Monte Carlo Markov Chain (MCMC) algorithm an almost i.i.d. chain from the posterior distribution and compute for each sample. The resulting histogram sketches the posterior density of the LR and the plot of the empirical cumulative distribution of the LR chain sketches the PLR as a function of .

The PLR has been realistically and thouroughly applied by smith10e to the detection of extra-solar planets from images acquired with the dedicated instrument SPHERE mounted on the Very Large Telescope. At this moment, only very finely simulated images were available. The PLR has been applied to two simulated datasets, one in which no extra-solar planet is present (dataset simulated under ) and the other in which an extra-solar planet is present ( dataset). Although the extra-solar planet is very dark ( times less bright than the star it surrounds), close to the star (angular distance in the sky of arcseconds i.e. degrees), and although only images were used, thanks to the quality of the optical instruments and of the statistical model the detection and not detection were evident, with for the dataset under and for the dataset under . As studied by smith10e, the statistical model and consecutive method are very satisfying compared to classical methods.

Despite its potential interest the PLR has not been extensively studied up to now. This paper aims at contributing in this investigating work by some new results.

In the simple vs composite hypotheses test case, it turns out that the PLR plays a strong role in understanding the possible reconciliation between frequentist and Bayesian hypothesis testing. The PLR with inner threshold is simply equal to some frequentist p-value for some “likelihood - prior - hypotheses” combinations. dempster74; aitkin97 have first noticed and highlighted this equivalence when testing the mean of a gaussian likelihood with a uniform prior.

In the section 2, we extend the conditions of this equivalence result under a frame analogous to the one used to reconcile confidence and credible domains. The subsection 2.1 synthesizes the long quest of reconciliation between frequentist and Bayesian hypotheses tests, the subsection 2.2 proves and discusses the reconciliation reached between the PLR and some frequentist p-value in such an invariant frame, the subsection 2.3 gives examples and perspectives, and the subsection 2.4 discusses the connection between this reconciliation result and the one obtained between (frequentist) confidence domains and (Bayesian) credible regions.

aitkin97 and aitkin10 extended the PLR definition to an hypotheses test frame identical to the one presented at the end of the subsection 1.1, namely , also considered by evans97 and others. However, the PLR has not been yet generalized to the general composite vs composite hypotheses test. The generalization is somehow unnatural for a frequentist p-value because for a simple hypothesis the p-value is a frequentist probability conditioned on the fixed parameter (see equation 2), although a conditional probability cannot be defined on a composite set if no probability distribution over is used. By contrast, the PLR is reciprocally a probability conditioned upon the observed dataset , and naturally remains fixed under a composite hypothesis. Therefore, the transition from simple to composite null hypothesis does not raise immediate obstacles for the PLR. However, a joint measure on the parameter spaces of both hypotheses is still required.

The section 3 proposes and motivates two generalizations of the PLR. The mathematical expressions of the two extensions are simply given and rephrased in the subsection 3.1. The first extension in particular enables the use of improper priors as soon as the posterior is proper. It can therefore be used in the subsection 3.2 for the detection of precipitation change where an almost improper prior is to be used but leads to a proper posterior. On the other side, the second extension, made of two symmetrical probabilities, appears in a Bayesian version of the Neyman-Pearson lemma. As detailed in the subsection 3.3, the two joint measures associated to no specific discrepancy variable lead through the lemma to the discrepancy variable .

A concluding discussion is proposed in the section 4. The appendices essentially present the proofs of the mathematical results.

## 2. Equivalence between the PLR and a frequentist p-value

### 2.1. Previous tentative reconciliations of frequentist and Bayesian tests

As introduced in the section 1.1, Lindley’s paradox presents a frame where (often thought as being the Bayesian measure of evidence) may be expected to be equal to the frequentist p-value, but happens not to be. Also, the BF is not satisfying in the frame “point null hypothesis and diffuse prior ”. This highlights the need for other Bayesian-type hypotheses tests, but also raises more generally the question of reconciliation between frequentist and Bayesian hypotheses tests.

The conditions upon which frequentist (neyman77) and Bayesian (jeffreys61) answers agree is always of interest in order to understand the interpretation of the procedures and the limits of the two paradigms, somehow defined by what they are not.

A first approach to see when could frequentist and Bayesian hypotheses tests be unified consists of analyzing, for different hypotheses likelihoods and priors, when are the classical p-value and equal. These two concepts are to be compared because they both seem to handle only and in very simple ways, one from the frequentist the other from the Bayesian perspectives333Note however that is implicitely taken into account through the marginal distribution of in .. It turns out that unlike for a composite null hypothesis (e.g. casella87), for a point null hypothesis Lindley’s paradox always seems to hold. berger87b in particular show that among very broad classes of priors always holds for . Also see the extensive list of references included. oh99 follows this analysis by studying the effect of the choice of .

Another approach consists of modifying the standard frequentist procedure and/or the standard Bayesian hypotheses test procedure, but still relying on the p-value and , to see if they can then be made equivalent. berger87 for example study “precise” (concentrated) but not exactly “point” hypotheses, berger94 use frequentist p-values computed from a likelihood conditioned upon a set in which lies the observed dataset, not on the dataset itself, and define a non-decision domain in the BF test procedure. sellke01 advocate calibrating (rescaling) the frequentist p-value to relate this new statistic to other test statistics.

As already mentioned in the section 1.1, one can also try to unify the p-value to Bayesian type statistics fully different from the BF, to see when frequentist and Bayesian types hypotheses tests can be made equivalent. In particular, when dempster74 proposed to use the PLR, he also mentioned that when testing the mean of a normal distribution, the PLR is equal to the classical frequentist p-value when computed for a uniform prior and with inner parameter . This fundamental result was again emphasized by aitkin97 and dempster97.

aitkin97 asymptotically extended this result to any regular distribution, making use of the asymptotic convergence of a regular distribution towards a normal distribution. For any regular continuous distribution and a smooth prior, the PLR, with , tends asymptotically to the classical p-value. Also, with a nuisance parameter and still calling the tested parameter, he defines LR by , in which case under the same conditions as in the previous case the PLR is equal to a p-value. For a normal distribution, when testing the mean and considering the variance as a nuisance parameter, the result is also true for a finite sample.

### 2.2. New reconciliation result

The sets of conditions found by dempster74 and aitkin97 under which the PLR (with ) is equal to a p-value are directly related to the test of the mean of a normal distribution under a uniform prior. The next subsection generalizes this exact finite-sample result under the frame of statistical invariance. As will be discussed at the end of the section, although the technical conditions derived here may be relaxed, it may be difficult to find, at least within the current statistical frame, a fundamentally more general frame of conditions for an equality between the PLR and a p-value to hold.

As presented in current classical textbooks in Bayesian statistics (berger85, robert07), invariance in statistics arises from the invariant Haar measure defined on some topological group. Throughout this subsection and the related appendices, we will use the notions and results synthesized by nachbin65 and eaton89. The tools necessary to understand the result are introduced in the appendix 1.

In this frame, the PLR (given by an integral over the parameter space ) can be reexpressed as an integral over the sample space , equal to a p-value for . In this subsection and denote random variables or variables of integration according to the context.

First, for clarity, we give the equivalence between the PLR and a frequentist integral under the assumption that the sample space , the parameter space and the transformations group are isomorphic.

###### Theorem 1.

Call a family of probability densities with respect to the Lebesgue measure on , and call a group acting on . Assume that is invariant under the action of the group on and note the induced action of the element on the element . Call and respectively a right and left Haar measures of and assume that

1. , and are isomorphic.

2. The prior measure is the measure induced by on .

3. The measure induced by on is absolutely continuous with respect to the Lebesgue measure. Call the corresponding density.

4. The marginal density of is finite, so that the posterior measure on , classically defined by the equation (23), defines the posterior probability .

Then, the PLR defined by the equation (4) can be reexpressed for any as the frequentist integral:

 (5) \textscPLR(x0,ζ) =Pr( p(x0|θ0)πl(x0) ≤ ζ p(x|θ0)πl(x)∣θ0)

where is the observed data and the parameter value under the null hypothesis.

A more general theorem (Theorem 2) derived in a frame which avoids the Lebesgue assumption and may involve more technical conditions is proved in Appendix 2. Theorem 1 is a consequence of Theorem 2 and its proof is given in Appendix 3.

The assumption that and are isomorphic is easily relaxed by replacing the sample space by the space of a sufficient statistic. Recall that if is a random variable whose probability distribution is parametrized by , is called a sufficient statistic of if the probability distribution of conditioned upon the random variable does not depend on . Note that according to the darmois35 theorem, among families of probability distributions whose domains do not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as the sample size increases.

The expression of the theorem 2 is simply extended by replacing by a sufficient statistic in the assumptions and by replacing in the frequentist integral the probability density of by the one of :

###### Corollary 1.

Call a family of probability densities with respect to any measure on . Call , for , a sufficient statistic of and the family of probability densities of with respect to the Lebesgue measure on . Call a group acting on ). Assume that is invariant under the action of the group on and note the induced action of the element on the element . Call and respectively any right and left Haar measures of . Assume that

1. , and are isomorphic.

2. The prior measure is the measure induced by on .

3. The measure induced by on is absolutely continuous with respect to the Lebesgue measure. Call the corresponding density.

4. The marginal density of is finite, so that the posterior measure on , classically defined by the equation (23), defines the posterior probability .

Then, the PLR defined by the equation (4) can be reexpressed, with , and , as the frequentist integral:

 (6) \textscPLR(x0,ζ) =Pr( pS(S(x0)|θ0)πl(S(x0)) ≤ ζ pS(S(x)|θ0)πl(S(x))∣θ0)

where is the observed data and the parameter value under the null hypothesis.

The proof follows the proof of the theorem 1 in the Appendix 3.

By evaluating in the result, the PLR with is easily and finally shown to be equal to a frequentist p-value, where the test statistic is a weighted marginal likelihood of the sufficient statistic .

###### Corollary 2.

Under the assumptions of the corollary 1, the PLR with inner threshold is equal to a p-value:

 (7) \textscPLR(x0,1)=pval(T(x0))

with the test statistic

 (8) T(x)=pS(S(x)|θ0)πl(S(x))

The corollary 2 can be reexpressed as the fact that under the invariance assumptions, rejecting when is equivalent to rejecting when where the p-value is based on the idea of rejecting when defined in equation (8) (observed weigthed likelihood under ) is not large enough.

### 2.3. Examples and perspective

dempster74 has shown that the PLR is equal to the classical p-value associated to the test statistic when testing the mean of a normal family for with a uniform prior on . The corollary 2 extends this result since the normal family is one of the distributions invariant under translation when testing the location parameter, the uniform prior (i.e. Lebesgue measure) is the measure induced from the right Haar measure associated to translation, and the test statistic is a monotone function of since the translation (sum) is commutative, so that for all and so is constant.

The result proved here concerns all distributions invariant under some group transformation, under the assumptions that there exists a sufficient statistic and that the sets , and are isomorphic. Assume for example that the likelihood has the typical form . The likelihood is invariant under the scale transformation and the actions on and are identical. Note that with is a pivotal quantity, meaning that its distribution does not depend on . The induced prior measure is classically given by . Since the multiplication transformation is commutative, the modulus is uniformly equal to 1, so that the test statistic that appears in the p-value (corollary 2) is simply where is the value of the parameter under . For a more general insight into the relationship between Haar invariance and the Fisher pivotal theory, see eaton99.

The theorem 2 assumes that , and are isomorphic. This assumption is relaxed in the corollaries 1 and 2 where the sample is replaced by a sufficient statistic : , and are assumed to be isomorphic. This trick is one of the two classical dimensionality reduction techniques concerning Haar measures applied to statistical problems and somehow restricts the likelihood to belong to the exponential family from Darmois theorem. The second trick consists schematically in replacing by the orbit of associated to the observed dataset . However, the whole set of assumptions that would be involved is more technical, see for example the general assumptions made by zidek69 or eaton02, and not investigated here.

### 2.4. Connection to other Bayesian and frequentist reconcilations

The result, which concerns hypothesis testing, may be related to the different approaches used to reconcile frequentist and Bayesian point estimation somehow and confidence interval especially.

Group invariance applied to invariant inference is the classical frame of such unifications. The Fisherian pivotal theory (fisher73) is an important contribution mainly to the “frequentist” side and the right Haar measure to the “Bayesian” side. The reconciliation of the two approaches has started with fraser61 and has been deeply studied since then, by zidek69 for example. The most general stage of unification is reached by eaton99. They explicit the central hypothesis of the Fisherian pivotal theory and show under quite standard assumptions in invariance that this hypothesis leads to a procedure which is identical to the Bayesian invariant procedure when using the prior induced by the right Haar measure. Note that they also show (and in a more general manner by eaton02) that for a Bayesian invariant inference to be admissible (in the sense that there exists no invariant inference whose mean quadratic error is lower for all ) it has to be obtained from the right Haar prior.

More concretely, the question related to reconciled probability domains is: “Under what assumptions does the following equality hold?”

 (9) Pr(θ∈R(x)∣∣x) =Pr(θ∈R(x)∣∣θ) =∫{x∣θ∈R(x)}dx p(x|θ)

For the equality to hold, each probability needs to be a constant. After fraser61 initial work, stein65 sketched the first conditions of what would be called later Stein’s theorem for invariant domains. The part which is common to the different “Stein’s theorems” is the following:

If a domain satisfies with , then under [some invariance assumptions],

 Pr(θ∈R(x)∣∣x) =c  ∀x∈X  (Bayesian % probability) and Pr(θ∈R(x)∣∣θ) =c  ∀θ∈Θ  (frequentist % probability)

One of the simplest set of assumptions found since stein65 is the one of chang86. It is relatively close to the one used for our results, presented in the section 2.2.

Our result, mainly holding in the theorem 1, is not a consequence of Stein’s theorem because the domain is not invariant in our case. would be invariant only if was invariant under the transformations group , i.e. if for all (this is equivalent to assuming that is invariant under ). But in the theorem 2, expressed and proved in the appendix 2 and used in the appendix 3 to prove the theorem 1, is assumed to be one-to-one for all , which implies that is equivalent to (identity function). So the domain is not invariant in our case and Stein’s theorem does not imply the reconciliation result presented in the section 2.2.

The theorem 1 does not answer the previous question, but rather relaxes the form of the domain and accepts a procedure that varies according to the observed dataset and the value of the parameter under . It answers to the question: “Under what assumptions and for what domains and does the following equality hold?”

 (10) ∫R(x0,θ0)⊂Θdθ π(θ|x0) =∫C(x0,θ0)⊂Xdx p(x|θ0)

The domains found take the form

 R(x0,θ0) ={θ∣p(x0|θ0)≤p(x0|θ)} C(x0,θ0) ={x∣p(x0|θ0)f(x0)≤p(x|θ0)f(x)}

where is some weighting function, actually given by the inverse of the left prior induced by the underlying group.

## 3. PLR for composite vs composite hypotheses testing

Up to this section, the PLR has been only defined in the simple () vs composite case, ie according to dempster74’s first definition.

For the more general hypothesis presented at the end of the section 1.1, Dempster’s approach has been generalized by aitkin97, with a modification presented by aitkin10. Namely, aitkin10 proposes to compute and details and illustrates some advantages of the method. In the case of , it corresponds to Dempster’s definition (see page 42 of aitkin10). The approach of evans97 also carries interesting properties. In particular, a variety of optimality properties for inferences based on relative belief ratios are established in evans06, evans08 and evans11, which include optimal testing properties based on establishing a kind of Bayesian version of the Neyman-Pearson lemma.

However, the hypotheses test case on which they rely is not broad enough for many cases. The purpose of this section is to extend the definition of the PLR to the classical composite vs composite hypotheses test.

Suppose the data models related to the two hypotheses belong to the same parametric family . This assumption can actually be realized for any hypotheses test of parametric models by merging the tested parametric families in a so-called super-model. A composite vs composite hypotheses test consists in choosing among

 (11) \textscH0:θ∈Θ0     \textscH1:θ∈Θ1

for any domains and . We note and the prior distributions over and .

In this section we propose two extensions of Dempster’s approach for this test case. The first extension proposed can be used when the prior under one hypothesis is improper but both posteriors are proper. The second extension, made of two symmetrical probabilities, is the statistics suggested by a new Bayesian-type Neyman-Pearson lemma which also indicates that the LR is a central discrepancy variable.

### 3.1. Extensions of the PLR

In the simple vs composite hypotheses test, the PLR was primarly defined as

 \textscPLR(x,ζ)=∫{θ1∣p(x|θ1)<ζp(x|θ0)}Π1(dθ1|x)

In the composite vs composite hypotheses test, a first interesting extension of this concept consists in defining the following statistics:

 (12) \textscPLR01(x,ζ) =∫{(θ0,θ1)∣p(x|θ0)<ζp(x|θ1)}Π0(dθ1|x)Π1(dθ0|x)

It is well defined as soon as the posterior distributions are both proper. Since only is known, the event can be measured only by integrating over all and all . Here we decide to measure it according to the posterior distribution of times the posterior distribution of , which is perfectly allowed.

A second interesting extension of the simple PLR consists in defining the two symmetrical following statistics:

 (13) \textscPLR0(x,ζ) =∫{(θ0,θ1)∣p(x|θ1)<ζp(x|θ0)}Π0(dθ0|x)Π1(dθ1) (14) \textscPLR1(x,ζ) =∫{(θ0,θ1)∣p(x|θ0)<ζp(x|θ1)}Π1(dθ1|x)Π0(dθ0)

In the simple vs composite test, note that only and are equal to the PLR as defined by dempster74 and can thus be considered as extensions of the PLR. However, given the symmetry of the two hypotheses in a composite vs composite test, the notation will be also necessary in the sequel.

Each quantity has its own definition, interpretation, properties and field of use. We don’t investigate interpretation far here, and rather focus on unquestionable properties and results.

is the only extension of the two which allows for using improper priors. It will be illustrated in the next subsection to test a practical precipitation change, which requires the use of a prior which is too smooth for the other extension to be used.

On the other side, the statistics is the expectation over the prior under of the posterior probability under that the likelihood of is less than the likelihood of , and reciprocally.

 \textscPLR1(x,ζ)=E0[Pr1(p(x|θ0)<ζp(x|θ1)|x)]

and will appear as statistics emerging from a more general frame through a Bayesian-type Neyman-Pearson lemma.

Extending the interpretation of the new PLRs in terms of joint probabilities requires the definition of a measure over given and one of the two hypotheses. Such a measure seems to make sense in terms of both mathematics and interpretation but the issue needs to be deepened.

###### Remark 1.

If all subsets defined on the sets and are independent, then the joint measure defined over is equal to:

 Π01,0(dθ0,dθ1|x)=Π0(dθ0|x)Π1(dθ1)

for infinitesimal subsets around any . The same holds when replacing the roles of and , and leads to the measure :

 Π01,1(dθ0,dθ1|x)=Π0(dθ0)Π1(dθ1|x)

The proof of the remark stands in the appendix 4. So if we assume that the joint measures exist and that the priors and posteriors are all proper, then the composite PLRs defined in the equations (13) and (14) are probability measures.

### 3.2. Example: detection of a change in precipitation in Switzerland

Let’s illustrate defined in the equation (12).

Although the change in temperature in the 20th century is evident at a world scale and in some areas, a potential change in precipitation remains under study. As a simple case, let’s consider a single weather station in Switzerland and test whether the statistical properties of the rain frequency have changed.

As recalled for example by aksoy00, daily precipitation amounts are well described by a gamma distribution, characterized by a shape parameter and a rate parameter . Assume the daily rainfalls fallen during the five first automns of the 20th century are i.i.d. with parameters and , as well as during the five last automns with parameters and . The detection of a statistical change consists in testing whether the set of parameters are equal or not:

 (15) \textscH0:(a1,b1)=(a2,b2)     \textscH1:(a1,b1)≠(a2,b2)

Note that the dimension of is less than the dimension of , so that for a regular prior under , . borges07 are particularly interested by the behavior of the e-value of the FBST in such cases. Here it simply means that there is one prior under and the product of two priors under , to be combined respectively with the likelihood under and the likelihood under .

To enable simple simulations of the posterior distributions under both hypotheses, the conjugate prior (see the compedium by fink97) of the gamma distribution developed by miller80 is used for , with hyperparameters that may vary without affecting much the final results. The impact of the prior on the PLR is easy to see from the PLR display as will be explained very shortly. In practice, the prior is almost improper so that only the defined in equation (12) can be used.

First, simulations roughly corresponding to the observed rainfall are performed. One dataset is simulated under and another is simulated under some reasonably similar alternative . The two simulated datasets are characterized by their likelihoods, displayed on the figure 1.

The posterior distribution of each couple , and is separately sampled by a MCMC multivariate slice sampling algorithm (radford03) implemented in the R package “SamplerCompare” kindly written and provided by thompson12. The PLR is simply computed by ordering the LR obtained for all possible combinations of parameters and counting the fraction which is less than some threshold chosen according to the level of evidence wanted in favor of or . In practice, the PLR is displayed as a function of by simply displaying the empirical cumulative distribution of the LRs. This leads to the figure 2. It can be read for example that for the dataset under , , which means that there is an almost null probability that the likelihood under is more than 10 times greater than the likelihood under , so that is (correctly) clearly accepted. Alternatively, for the dataset, , meaning that there is a probabity one that the likelihood under is more than 10 times greater than the likelihood under , so that is (correctly) clearly rejected.

Note that since the GLR indicates the lower bound of the support of the PLR and since the slope of the PLR is infinite there if the likelihood function is smooth enough at its maximum (see section 1.2), the prior exact expression only affects the way increases as departs from . Here for example the choice of the hyperparameters (among a domain considered as reasonable) does not change the conclusion that would be drawn from the PLR displayed on the figure 2.

Switching to the true dataset , the PLR is obtained following the same procedure as with the simulated datasets. is displayed on the figure 3. The graph is –by construction of the simulations– very similar the one obtained for the graph obtained with the data simulated under . Now, and can clearly not be rejected, so that no change in the 20th precipitation in Switzerland is detected, which is not surprising to climatologists.

### 3.3. Bayesian type Neyman-Pearson lemma

In the choice of an hypothesis, instead of considering the subset

 (16) R∗(x)={(θ0,θ1)∣p(x|θ0)<ζp(x|θ1)}

one might consider any subset , that may depend on . Such a subset could involve a discrepancy variable like in the predictive p-value highlighted by meng94, and take the form “”. The discrepancy variable that appears in the PLR is .

defined from the LR test is interesting for hypotheses testing because this set is a somehow classical hypothesis rejection set. It is not a fully classical rejection set because it is defined on the parameter space rather than on the observation space, but its characterization is optimal in the frequentist setting. is the set, depending on the dataset , of all fixed such that the likelihood of is less than the likelihood of , which reasonably leads to reject for this element . The same way, one can replace the LR test by any test, ie consider any subset such that for , would be decided to be rejected.

With such a phrasing, it may appear natural that the frequentist Neyman-Pearson lemma can be derived in a reciprocal, somehow Bayesian, frame. Note that the Neyman-Pearson lemma can be expressed, as will be the proposition here, symmetrically in the two hypotheses. The symmetry is only broken when adopting the Neyman paradigm which fixes a level for the PFA and deduce the corresponding (see section 1.1).

To rederive a Neyman-Pearson lemma one would define the reciprocal notions of “Probability of False Alarm” and “Probability of good Detection”:

 (17) \textscPFAB(R,x) =∫R(x)Π0(dθ0|x)Π1(dθ1) (18) \textscPDB(R,x) =∫R(x)Π1(dθ1|x)Π0(dθ0)

These quantities would also define probability measures if the joint measures exist and if the priors and posteriors are all proper.

Note that these measures can be related to a joint measure with no conditioning over the hypothesis: for any set eventually depending on ,

 Pr(R,x) =Pr(H0) Pr(R|x,H0)+Pr(H1) Pr(R|x,H1) with  Pr(R|x,Hi) =∫RΠ01,i(dθ0,dθ1|Hi,x) =∫RΠi(dθi|x)Πj(dθj) so  Pr(R|x) =Pr(H0)∫RΠ0(dθ0|x)Π1(dθ1)+Pr(H1)∫RΠ1(dθ1|x)Π0(dθ0) =Pr(H0) \textscPFA0(R,x)+Pr(H1) \textscPD1(R,x)

The Bayesian type probabilities and add up the same way type I and type II probability errors add up in a frequentist integral. Note also that where is the set complementary to in .

Following the underlying idea of the Neyman-Pearson approach, a possibility for choosing consists in maximizing over for a fixed .

###### Proposition 1.

The subset that maximizes for a fixed value of is equal to the LR subset defined in equation (16). In this case, the “Bayesian PFA and PD” are given by and .
Reciprocally, the subset that maximizes for a fixed value of is equal to , ie the set which accepts according to the LR test. In this case, and .

As postdata measures (i.e. depending on the observed data), contrary to the predata frequentist PFA and PD, it is therefore informative enough to give and for some value of interest. But this is only possible if the priors and posteriors under both hypotheses are proper.

The proof of the proposition follows the proof of the Neyman-Pearson lemma restricted to deterministic tests. It stands in the appendix 5.

## 4. Concluding general discussion about the PLR

The PLR introduced by dempster74 in the simple vs composite hypotheses test deserves much attention. It compares the original likelihoods and by computing the posterior probability that this usual LR test chooses or . The PLR is simple, nicely interpretable and coupled with some deep properties. Compared to the classical Bayesian hypotheses tests, first note that unlike the BF, the PLR can be defined even for improper priors, and unlike it does not require the delicate choice of some . This is crucial in practice as well as in fundamental issues like Lindley’s paradox.

The PLR also turns out to be a very natural alternative to the BF in many aspects. The PLR first compares (the original likelihoods) and then integrates, whereas the BF first integrates and then compares (the marginal likelihoods). In the simple vs composite hypotheses test, considering as a random variable for a fixed , the PLR is its posterior cumulative distribution (i.e. the probability of a one sided credible interval) whereas the BF is its posterior mean point estimate. This credible interval vs point estimate duality between the PLR and the BF also translates in decision theory: hwang92 stressed that does not measure evidence, since this is done only through the likelihood, but measures the accuracy of a test by estimating the indicator function . Also note that being the measure of a credible interval, the PLR is also a natural hypotheses test tool which connects postdata (i.e. conditioned upon ) hypotheses testing and credible interval inference. This formal equivalence was known to hold for predata inference (a rejection set is equivalent to a confidence interval) and “known” not to hold for postdata inference for usual Bayesian tools (see lehmann05 and goutis97). Tools like the PLR set up this connection.

However, when generalizing the PLR in the section 3.1, most of these dual properties cannot be generalized to the composite vs composite hypotheses test. Instead, a reciprocity between the PLR and the BF exists through a Neyman-Pearson lemma perspective. The second extension of the PLR has been shown in the section 3.3 to be a somehow optimal measure, in that it measures the set that maximizes for a fixed (Bayesian-type version of the frequentist Neyman-Pearson lemma). Reciprocally, the BF gives a somehow optimal measure, although in the frequentist Neyman-Pearson sense, in that it maximizes the average over of for a fixed PFA (frequentist classical Neyman-Pearson lemma but for the marginal likelihood and not the original unknown one).

In the simple vs composite hypotheses test, the connection between the PLR (related to credible interval) and the BF (related to point estimate) has been underlined. Another important connection lies between frequentist and Bayesian type hypotheses tests, namely frequentist p-values and or PLR. This reconciliation quest has been the subject of many debates, including Lindley’s paradox in its most simple form (test of the mean of a Gaussian with a uniform prior), which has only been simply reached by the PLR by dempster74. In the section 2.2 we have generalized this reconciliation result to a quite general invariant frame, close to the one used in Stein’s theorem, i.e. in a frame under which confidence and credible intervals are equivalent. Note that invariance is also a perspective adopted to develop and evaluate inferences, and in particular to develop new p-values as done recently by evans10 for example. For the PLR, standard simple invariance properties directly follows from the simple use of the likelihoods.

To conclude on the contribution of this paper, the equivalence between the PLR and a p-value has been proved in a general invariant frame, which nicely connects to the equivalence between confidence and credible domains. This result may contribute to a better understanding of deep and fundamental issues related to both hypotheses testing and parameter estimation, in both frequentist and Bayesian paradigms.

## Appendix 1: Introduction to invariance in statistics

For a locally compact Hausdorff group , denotes the class of all continuous real-valued functions on that have compact support. The left invariant Haar measure on is defined as a Radon measure such that for all and all ,

 ∫Gf(g)Hl(dg)=∫Gf(g0g)Hl(dg)

The right invariant Haar measure on is defined as but replacing by . For a given group, both Haar measures exist and are unique up to multiplicative constants.

The (right) modulus of is the real positive valued function such that if is a left invariant Haar measure, then for all and all ,

 (19) ∫f(gg−10)Hl(dg)=Δ(g0)∫f(g)Hl(dg)

From the unicity of the Haar measure, does not depend on the choice of and is a continuous function such that for all , , which implies that . Note that for a group the set of all right Haar measures is equal to the set of the left Haar measures if and only if is identically equal to 1. This occurs for example when is compact or commutative.

Concerning the Haar measures on the group , the initial definitions and properties imply that if is a left invariant Haar measure on and the modulus of then for all

 (20) ∫f(g−1)Hl(dg)=∫f(g)Δ(g)−1Hl(dg)

The modulus also enables to relate right and left invariant Haar measures. From the last property, the measure defined by

 (21) Hr(dg)=Δ(g)−1Hl(dg)

is a right invariant Haar measure on . The same way, if is a right invariant Haar measure on , then the measure defined by is a left invariant Haar measure.

The Haar measure is applied to statistics through the concept of invariance of a data model under a group of transformations. A parametric family of densities with respect to any measure on is said to be invariant under the transformations group if for each there exists a unique such that if the distribution of has the density then has the density . This property defines the action of on : may simply be denoted where defines a group.

A measure on is said to be relatively invariant with multiplier under the group if for all and

 (22) ∫f(x)μ(dx)=χ(g)∫f(gx)μ(dx)

If we assume that both the family of densities and the measure are respectively invariant and relatively invariant, schematically we get for all and . For more about the connection between such a multiplier and the Jacobian of the transformation that leads to from , see for example berger85 or eaton07. Note that the theorem 2 could be formulated differently, by defining the invariance of a probability model, but this phrasing is less common than the invariance of a family of probability densities and this would have entailed a longer presentation.

To shorten the preliminaries and without assuming any knowledge about group theory, we will not refer to group properties like transitivity, orbits… and will concretely simply assume that and are isomorphic. More precisely, we will assume that the transformation with is one-to-one whatever . The right Haar prior on is to be induced from the right Haar measure on and the action of on . From the frame chosen, the right Haar prior is simply defined by , with . As shown in villegas81, it turns out that the measure actually does not depend on . The induced prior is therefore unique for a fixed and noted . means that for any measurable subset , with . Note that a subset denotes an infinitesimal subset centered around , where is implicit. can be normalized into a probability measure if and only if the group is compact, and in this case we can go back to the usual notation where the measure is implicit in .

Finally, from the data model density and the prior , the posterior measure on is classically defined by

 (23) Πrx(B) =