Objective Priors: An Introduction for Frequentists\thanksrefT1
Bayesian methods are increasingly applied in these days in the theory and practice of statistics. Any Bayesian inference depends on a likelihood and a prior. Ideally one would like to elicit a prior from related sources of information or past data. However, in its absence, Bayesian methods need to rely on some “objective” or “default” priors, and the resulting posterior inference can still be quite valuable.
Not surprisingly, over the years, the catalog of objective priors also has become prohibitively large, and one has to set some specific criteria for the selection of such priors. Our aim is to review some of these criteria, compare their performance, and illustrate them with some simple examples. While for very large sample sizes, it does not possibly matter what objective prior one uses, the selection of such a prior does influence inference for small or moderate samples. For regular models where asymptotic normality holds, Jeffreys’ general rule prior, the positive square root of the determinant of the Fisher information matrix, enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, however, there are many other priors which emerge as optimal depending on the criterion selected. One new feature in this article is that a prior different from Jeffreys’ is shown to be optimal under the chi-square divergence criterion even in the absence of nuisance parameters. The latter is also invariant under one-to-one reparameterization.
26 \issue2 2011 \firstpage187 \lastpage202 \doi10.1214/10-STS338 \newproclaimremarkRemark \newproclaimexampleExample
T1Discussed in \doi10.1214/11-STS338A and \doi10.1214/11-STS338B; rejoinder at \doi10.1214/11-STS338REJ.
Asymptotic expansion \kwddivergence criterion \kwdfirst-order probability matching \kwdJeffreys’ prior \kwdleft Haar priors \kwdlocation family \kwdlocation–scale family \kwdmultiparameter \kwdorthogonality \kwdreference priors \kwdright Haar priors \kwdscale family \kwdsecond-order probability matching \kwdshrinkage argument.
Bayesian methods are increasingly used in recent years in the theory and practice of statistics. Their implementation requires specification of both a likelihood and a prior. With enough historical data, it is possible to elicit a prior distribution fairly accurately. However, even in its absence, Bayesian methods, if judiciously used, can produce meaningful inferences based on the so-called “objective” or “default” priors.
The main focus of this article is to introduce certain objective priors which could be potentially useful even for frequentist inference. One such example where frequentists are yet to reach a consensus about an “optimal” approach is the construction of confidence intervals for the ratio of two normal means, the celebrated Fieller–Creasy problem. It is shown in Section 4 of this paper how an “objective” prior produces a credible interval in this case which meets the target coverage probability of a frequentist confidence interval even for small or moderate sample sizes. Another situation, which has often become a real challenge for frequentists, is to find a suitable method for elimination of nuisance parameters when the dimension of the parameter grows in direct proportion to the sample size. This is what is usually referred to as the Neyman–Scott phenomenon. We will illustrate in Section 3 with an example of how an objective prior can sometimes overcome this problem.
Before getting into the main theme of this paper, we recount briefly the early history of objective priors. One of the earliest uses is usually attributed to Bayes (1763) and Laplace (1812) who recommended using a uniform prior for the binomial proportion in the absence of any other information. While intuitively quite appealing, this prior has often been criticized due to its lack of invariance under one-to-one reparameterization. For example, a uniform prior for in the binomial case does not result in a uniform prior for . A more compelling example is that a uniform prior for , the population standard deviation, does not result in a uniform prior for , and the converse is also true. In a situation like this, it is not at all clear whether there can be any preference to assign a uniform prior to either or .
In contrast, Jeffreys’ (1961) general rule prior, namely, the positive square root of the determinant of the Fisher information matrix, is invariant under one-to-one reparameterization of parameters. Wewill motivate this prior from several asymptotic considerations. In particular, for regular models where asymptotic normality holds, Jeffreys’ prior enjoys many optimality properties in the absence of nuisance parameters. In the presence of nuisance parameters, this prior suffers from many problems—marginalization paradox, the Neyman–Scott problem, just to name a few. Indeed, for the location–scale models, Jeffreys himself recommended alternate priors.
There are several criteria for the construction of objective priors. The present article primarily reviews two of these criteria in some detail, namely, “divergence priors” and “probability matching priors,” and finds optimal priors under these criteria. The class of divergence priors includes “reference priors” introduced by Bernardo (1979). The “probablity matching priors” were introduced by Welch and Peers (1963). There are many generalizations of the same in the past two decades. The development of both these priors rely on asymptotic considerations. Somewhat more briefly, I have discussed also a few other priors including the “right” and “left” Haar priors.
The paper does not claim the extensive thorough and comprehensive review of Kass and Wasserman (1996), nor does it aspire to the somewhat narrowly focused, but a very comprehensive review of probability matching priors as given in Ghosh and Mukerjee (1998), Datta and Mukerjee (2004) and Datta and Sweeting (2005). A very comprehensive review of reference priors is now available in Bernardo (2005), and a unified approach is given in the recent article of Berger, Bernardo and Sun (2009).
While primarily a review, the present article has been able to unify as well as generalize some of the previously considered criteria, for example, viewing the reference priors as members of a bigger class of divergence priors. Interestingly, with some of these criteria as presented here, it is possible to construct some alternatives to Jeffreys’ prior even in the absence of nuisance parameters.
The outline of the remaining sections is as follows. In Section 2 we introduce two basic tools to be used repeatedly in the subsequent sections. One such tool involving asymptotic expansion of the posterior density is due to Johnson (1970), and Ghosh, Sinha and Joshi (1982), and is discussed quite extensively in Ghosh, Delampady and Samanta (2006) and Datta and Mukerjee (2004). The second tool involves a shrinkage argument suggested by Dawid and used extensively by J. K. Ghosh and his co-authors. It is shown in Section 3 that this shrinkage argument can also be used in deriving priors with the criterion of maximizing the distance between the prior and the posterior. The distance measure used includes, but is not limited to, the Kullback–Leibler (K–L) distance considered in Bernardo (1979) for constructing two-group “reference priors.” Also, in this section we have considered a new prior different from Jeffreys even in the one-parameter case which is also invariant under one-to-one reparameterization. Section 4 addresses construction of priors under probability matching criteria. Certain other priors are introduced in Section 5, and it is pointed out that some of these priors can often provide exact and not just asymptotic matching. Some final remarks are made in Section 6.
Throughout this paper the results are presented more or less in a heuristic fashion, that is, without paying much attention to the regularity conditions needed to justify these results. More emphasis is placed on the application of these results in the construction of objective priors.
2 Two Basic Tools
An asymptotic expansion of the posterior density began with Johnson (1970), followed up later by Ghosh, Sinha and Joshi (1982), and many others. The result goes beyond that of the theorem of Bernstein and Von Mises which provides asymptotic normality of the posterior density. Typically, such an expansion is centered around the MLE (and occasionally the posterior mode), and requires only derivatives of the log-likelihood with respect to the parameters, and evaluated at their MLE’s. These expansions are available even for heavy-tailed densities such as Cauchy because finiteness of moments of the distribution is not needed. The result goes a long way in finding asymptotic expansion for the posterior moments of parameters of interest as well as in finding asymptotic posterior predictive distributions.
The asymptotic expansion of the posterior resembles that of an Edgeworth expansion, but, unlike the latter, this approach does not need use of cumulants of the distribution. Finding cumulants, though conceptually easy, can become quite formidable, especially in the presence of multiple parameters, demanding evaluation of mixed cumulants.
We have used this expansion as a first step in the derivation of objective priors under different criteria. Together with the shrinkage argument as mentioned earlier in the Introduction, and to be discussed later in this section, one can easily unify and extend many of the known results on prior selection. In particular, we will see later in this section how some of the reference priors of Bernardo (1979) can be found via application of these two tools. The approach also leads to a somewhat surprising result involving asymptotic expansion of the distribution function of the MLE in a fairly general setup, and is not restricted to any particular family of distributions, for example, the exponential family, or the location–scale family. A detailed exposition is available in Datta and Mukerjee (2004, pages 5–8).
For simplicity of exposition, we consider primarily the one-parameter case. Results needed for the multiparameter case will occasionally be mentioned, and, in most cases, these are straightforward, albeit often cumbersome, extensions of one-parameter results. Moreover, as stated in the Introduction, the results will be given without full rigor, that is, without any specific mention of the needed regularity conditions.
We begin with i.i.d. with common p.d.f. . Let denote the MLE of . The likelihood function is denoted by and let . Let , and let , the observed per unit Fisher information number. Consider a twice differentiable prior . Let , and let denote the posterior p.d.f. of given. Then, under certain regularity conditions, we have the following result.
, where is the standard normal p.d.f., and
The proof is given in Ghosh, Delampdy and Samanta (2006, pages 107–108). The statement involves a few minor typos which can be corrected easily. We outline here only a few key steps needed in the proof.
We begin with the posterior p.d.f.,
Substituting , the posterior p.d.f. of is given by
The rest of the proof involves a Taylor expansion of and around up to a desired order, and collecting the coefficients of , , etc. The other component is evaluation of via momets of the N(0, 1) distribution.
The above result is useful in finding certain expansions for the posterior moments as well. In particular, noting , it follows that the asymptotic expansion of the posterior mean of is given by
A multiparameter extension of Theorem 1 is as follows. Suppose that is the parameter vector and is the MLE of . Let
and . Then retaining only up to the term, the posterior of is given by
Next we present the basic shrinkage argument of J. K. Ghosh discussed in detail in Datta and Mukherjee (2004). The prime objective here is evaluation of , say, where and can bereal- or vector-valued. The idea is to find first through a sequence of priors defined on a compact set, and then shrinking the prior to degeneracy at some interior point, say, of the compact set. The interesting point is that one never needs explicit specification of in carrying out this evaluation. We will see several illustrations of this in this article.
First, we present the shrinkage argument in a nutshell. Consider a proper prior with a compact rectangle as its support in the parameter space, and vanishes on the boundary of support, while remaining positive in the interior. The support of is the closure of the set. Consider the posterior of under and, hence, obtain . Then find for in the interior of the support of . Finally, integrate with respect to , and then allow to converge to the degenerate prior at the true value of at an interior point of the support of . This yields . The calculation assumes integrability of over the joint distribution of and . Such integrability allows change in the order of integration.
When executed up to the desired order of approximation, under suitable assumptions, these steps can lead to significant reduction in the algebra underlying higher order frequentist asymptotics. The simplification arises from two counts. First, although the Bayesian approach to frequentist asymptotics requires Edgeworth type assumptions, it avoids an explicit Edgeworth expansion involving calculation of approximate cumulants. Second, as we will see, it helps establish the results in an easily interpretable compact form. The following two sections will demonstrate multiple usage of these two basic tools.
3 Objective Priors Via Maximization of the Distance Between the Prior and the Posterior
3.1 Reference Priors
We begin with an alternate derivation of the reference prior of Bernardo. Following Lindley (1956), Bernardo (1979) suggested a Kullback–Leibler (K–L) divergence between the prior and the posterior, namely, , where expectation is taken over the joint distribution of and . The target is to find a prior which maximizes the above distance. It is shown in Berger and Bernardo (1989) that if one does this maximization for a fixed , this may lead to a discrete prior with finitely many jumps, a far cry from a diffuse prior. Hence, one needs an asymptotic maximization.
First write as
where , , the likelihood function, and denotes the marginal of after integrating out . The integrations are carried out with respect to a prior having a compact support, and subsequently passing on to the limit as and when necessary.
Without any nuisance parameters, Bernardo(1979) showed somewhat heuristically that Jeffreys’ prior achieves the necessary maximization. A more rigorous proof was supplied later by Clarke and Barron (1990, 1994). We demonstrate heuristically how the shrinkage argument can also lead to the reference priors derived in Bernardo (1979). To this end, we first consider the one-parameter case for a regular family of distributions. We rewrite
Next we write
From the asymptotic expansion of the posterior, one gets
Since converges a posteriori to a distribution as , irrespective of a prior , by the Bernstein–Von Mises and Slutsky’s theorems, one gets
In view of (3.1), considering only the leading terms in (3.1), one needs to find a prior which maximizes . The integral being nonpositive due to the property of the Kullback–Leibler information number, its maximum value is zero, which is attained for , leading once again to Jeffreys’ prior.
The multiparameter generalization of the above result without any nuisance parameters is based on the asymptotic expansion
and maximization of the leading term yields once again Jeffreys’ general rule prior .
In the presence of nuisance parameters, however, Jeffreys’ general rule prior is no longer the distance maximizer. We will demonstrate this in the case when the parameter vector is split into two groups, one group consisting of the parameters of interest, and the other involving the nuisance parameters. In particular, Bernardo’s (1979) two-group reference prior will be included as a special case.
To this end, suppose , where () is the parameter of interest and () is the nuisance parameter. We partition the Fisher information matrix as
First begin with a general conditional prior (say). Bernardo (1979) considered . The marginal prior for is then obtained by maximizing the distance. We begin by writing
Writing and , where , the asymptotic expansion and the shrinkage argument together yield
Writing , once again by property of the Kullback–Leibler information number, it follows that the maximizing prior .
We have purposely not set limits for these integrals. An important point to note [as pointed out in Berger and Bernardo (1989)] is that evaluation of all these integrals is carried out over an increasing sequence of compact sets whose union is the entire parameter space. This is because most often we are working with improper priors, and direct evaluation of these integrals over the entire parameter space will simply give which does not help finding any prior. As an illustration, if the parameter space is as is typically the case for location–scale family of distributions, then one can take the increasing sequence of compact sets as , . All the proofs are usually carried out by taking a sequence of priors with compact support , and eventually making . This important point should be borne in mind in the actual derivation of reference priors. We will now illustrate this for the location–scale family of distributions when one of the two parameters is the parameter of interest, while the other one is the nuisance parameter.
[(Location–scale models)] Suppose are i.i.d. with common p.d.f. , where and . Consider the sequence of priors with support , We may note that , where the constants , and are functions of and do not involve either or . So, if is the parameter of interest, and is the nuisance parameter, following Bernardo’s (1979) prescription, one begins with the sequence of priors where, solving , one gets . Next one finds the prior which is a constant not depending on either or . Hence, the resulting joint prior , which is the desired reference prior. Incidentally, this is Jeffreys’ independence prior rather than Jeffreys’ general rule prior, the latter being proportional to . Conversely, when is the parameter of interest and is the nuisance parameter, one begins with and then, following Bernardo (1979) again, one finds . Thus, onceagain one gets Jeffreys’ independence prior. We will see in Section 5 that Jeffreys’ independence prior is a right Haar prior, while Jeffreys’ general rule prior is a left Haar prior for the location–scale family of distributions.
[(Noncentrality parameter)] Let be i.i.d. N(), where real and are both unknown. Suppose the parameter of interest is , the noncentrality parameter. With the reparameterization from , the likelihood is rewritten as . Then the per observation Fisher information matrix is given by . Consider once again the sequence of priors with support , Again, following Bernardo, , where . Noting that , one gets . Hence, the reference prior in this example is given by . Due to its invariance property (Datta and Ghosh, 1996), in the original ) parameterization, the two-group reference prior turns out to be .
Things simplify considerably if and are orthgonal in the Fisherian sense, namely, (Huzurbazaar, 1950; Cox and Reid, 1987). Then if and factor respectively as and , as a special case of a more general result of Datta and Ghosh (1995c), it follows that the two-group reference prior is given by.
As an illustration of the above, consider the celebrated Neyman–Scott problem (Berger and Bernardo, 1992a, 1992b). Consider a fixed effects one-way balanced normal ANOVA model where the number of observations per cell is fixed, but the number of cells grows to infinity. In symbols, let be mutually independent N(, , , all parameters being assumed unknown. Let . Then the MLE of is given by which converges in probability [as to ], and hence is inconsistent. Interestingly, Jeffreys’prior in this case also produces an inconsistent estimator of , but the Berger–Bernardo reference prior does not.
To see this, we begin with Fisher Information matrix . Hence, Jeffreys’ prior which leads to the marginal posterior of , denoting the entire data set. Then the posterior mean of is given by , while the posterior mode is given by . Both are inconsistent estimators of , as these converge in probability to as .
In contrast, by the result of Datta and Ghosh (1995c), the two-group reference prior . This leads to the marginal posterior of . Now the posterior mean is given by , while the posterior mode is given by . Both are consistent estimators of .
[(Ratio of normal means)] Let and be two independent N() random variables, where the parameter of interest is . This is the celebrated Fieller–Creasy problem. The Fisher information matrix in this case is . With the transformation , one obtains . Again, by Datta and Ghosh (1995c), the two-group reference prior .
[(Random effects model)] This example has been visited and revisited on several occasions. Berger and Bernardo (1992b) first found reference priors for variance components in this problem when the number of observations per cell is the same. Later, Ye (1994) and Datta and Ghosh (1995c, 1995d) also found reference priors for this problem. The case involving unequal number of observations per cell was considered by Chaloner (1987) and Datta, Ghosh and Kim (2002).
For simplicity, we consider here only the case with equal number of observations per cell. Let , . Here is an unknown parameter, while ’s and are mutually independent with ’s i.i.d. N() and i.i.d. N(). The parameters , and are all unknown. We write , , and . The minimal sufficient statistic is (, where and .
The different parameters of interest that we consider are , and . The common mean is of great relevance in meta analysis (cf. Morris and Normand, 1992). Ye (1994) pointed out that the variance ratio is of considerable interest in genetic studies. The parameter is also of importance to animal breeders, psychologists and others. Datta and Ghosh (1995d) have discussed the importance of , the error variance. In order to find reference priors for each one of these parameters, we first make the one-to-one transformation from to , where and . Thus, , and the likelihood can be expressed as
Then the Fisher information matrix simplifies to . From Theorem 1 of Datta and Ghosh (1995c), it follows now that when , and are the respective parameters of interest, while the other two are nuisance parameters, the reference priors are given respectively by , and .
3.2 General Divergence Priors
which is to be interpreted as its limit when . This limit is the K–L distance as considered in Bernardo (1979). Also, gives the Bhattacharyya–Hellinger (Bhattacharyya, 1943; Hellinger, 1909) distance, and leads to the chi-square distance (Clarke and Sun, 1997, 1999). In order to maximize with respect to a prior , one re-expresses (3.2) as
Hence, from (3.2), maximization of amounts to minimization (maximization) of
for (). First consider the case . From Theorem 1, the posterior of is
Following the shrinkage argument, and noting that conditional on , , while , it follows heuristically from (3.2)
Hence, from (3.2), considering only the leading term, for , minimization of (17) with respect to amounts to minimization of with respect to subject to . A simple application of Holder’s inequality shows that this minimization takes place when . Similarly, for , provides the desired maximization of the expected distance between the prior and the posterior. The K–L distance, that is, when has already been considered earlier.
Equation (3.2) also holds for . However, in this case, it is shown in Ghosh, Mergel and Liu (2011) that the integral is uniquely minimized with respect to , and there exists no maximizer of this integral when . Thus, in this case, there does not exist any prior which maximizes the posterior-prior distance.
Surprisingly, Jeffreys’ prior is not necessarily the solution when (the chi-square divergence). In this case, the first-order asymptotics does not work since for all . However, retaining also the term as given in Theorem 1, Ghosh, Mergel and Liu (2011) have found in this case the solution , where . We shall refer to this prior as . We will show by examples that this prior may differ from Jeffreys’prior. But first we will establish a hitherto unknown invariance property of this prior under one-to-one reparameterization.
Suppose that is a one-to-one twice differentiable function of . Then , where , the constant of proportionality, does not involve any parameters.
Without loss of generality, assume that is a nondecreasing function of . By the identity