Bayesian computational methodsChristian P. Robert Université Paris-Dauphine, CEREMADE, and CREST, INSEE, Paris

Bayesian computational methods
Christian P. Robert Université Paris-Dauphine, CEREMADE, and CREST, INSEE, Paris

0.1 Introduction

If, in the mid 1980’s, one had asked the average statistician about the difficulties of using Bayesian Statistics, the most likely answer would have been “Well, there is this problem of selecting a prior distribution and then, even if one agrees on the prior, the whole Bayesian inference is simply impossible to implement in practice!” The same question asked in the 21th Century does not produce the same reply, but rather a much less aggressive complaint about the lack of generic software (besides winBUGS), along with the renewed worry of subjectively selecting a prior! The last 20 years have indeed witnessed a tremendous change in the way Bayesian Statistics are perceived, both by mathematical statisticians and by applied statisticians and the impetus behind this change has been a prodigious leap-forward in the computational abilities. The availability of very powerful approximation methods has correlatively freed Bayesian modelling, in terms of both model scope and prior modelling. This opening has induced many more scientists from outside the statistics community to opt for a Bayesian perspective as they can now handle those tools on their own. As discussed below, a most successful illustration of this gained freedom can be seen in Bayesian model choice, which was only emerging at the beginning of the MCMC era, for lack of appropriate computational tools.

In this chapter, we will first present the most standard computational challenges met in Bayesian Statistics (Section 0.2), and then relate these problems with computational solutions. Of course, this chapter is only a terse introduction to the problems and solutions related to Bayesian computations. For more complete references, see Robert and Casella (2004), Marin and Robert (2007a), Robert and Casella (2004) and Liu (2001), among others. We also restrain from providing an introduction to Bayesian Statistics per se and for comprehensive coverage, address the reader to Marin and Robert (2007a) and Robert (2007), (again) among others.

0.2 Bayesian computational challenges

Bayesian Statistics being a complete inferential methodology, its scope encompasses the whole range of standard statistician inference (and design), from point estimation to testing, to model selection, and to non-parametrics. In principle, once a prior distribution has been chosen on the proper space, the whole inferential machinery is set and the computation of estimators is usually automatically derived from this setup. Obviously, the practical or numerical derivation of these procedures may be exceedingly difficult or even impossible, as we will see in a few selected examples. Before, we proceed with an incomplete typology of the categories and difficulties met by Bayesian inference. First, let us point out that computational difficulties may originate from one or several of the following items:

  1. use of a complex parameter space, as for instance in constrained parameter sets like those resulting from imposing stationarity constraints in dynamic models;

  2. use of a complex sampling model with an intractable likelihood, as for instance in missing data and graphical models;

  3. use of a huge dataset;

  4. use of a complex prior distribution (which may be the posterior distribution associated with an earlier sample);

  5. use of a complex inferential procedure.

0.2.1 Bayesian point estimation

In a formalised representation of Bayesian inference, the statistician is given (or selects) a triplet

  • a sampling distribution, , usually associated with an observation (or a sample) ;

  • a prior distribution , defined on the parameter space ;

  • a loss function that compares the decisions (or estimations) for the true value of the parameter.

Using and an observation , the Bayesian inference is always given as the solution to the minimisation programme

equivalent to the minimisation programme

The corresponding procedure is thus associated, for every , to the solution of the above programme (see, e.g. Robert, 2007, Chap. 2).

There are therefore two levels of computational difficulties with this resolution: first the above integral must be computed. Second, it must be minimised in . For the most standard losses, like the traditional squared error loss,

the solution to the minimisation problem is universally11endnote: 1In this chapter, the denomination universal is used in the sense of uniformly over all distributions. known. For instance, for the squared error loss, it is the posterior mean,

which still requires the computation of both integrals and thus whose complexity depends on the complexity of , , and .

Example 1

For a normal distribution , the use of a so-called conjugate prior (see, e.g., Robert, 2007, Chap. 3)

leads to a closed form expression for the mean, since

On the other hand, if we use instead a more involved prior distribution like a poly- distribution (Bauwens and Richard, 1985),

the above integrals cannot be computed in closed form anymore. This is not a toy example in that the problem may occur after a sequence of Student’s observations, or with a sequence of normal observations whose variance is unknown.

The above example is one-dimensional, but, obviously, bigger challenges await the Bayesian statistician when she wants to tackle high-dimensional problems.

Example 2

In a generalised linear model, a conditional distribution of given is defined via a density from an exponential family

whose natural parameter depends on the conditioning variable ,

that is, linearly modulo the transform . Obviously, in practical applications like Econometrics, can be quite large. Inference on (which is the true parameter of the model) proceeds through the posterior distribution (where and )

which rarely is available in closed form. In addition, in some cases may be costly simply to compute and in others may be large or even very large. Take for instance the case of the dataset processed by Abowd et al. (1999), which covers twenty years of employment histories for over a million workers, with including indicator variables for over one hundred thousand companies.

Complexity starts sharply increasing if we introduce in addition random effects to the model, that is writing as , where is a random perturbation indexed by . For instance, in the above employment dataset, it may correspond to a worker effect or to a company effect. The difficulty is that those random variables can very rarely be integrated out into a closed-form marginal distribution. They must therefore be included within the model parameter, which then increases its dimension severalfold.

A related, although conceptually different, inferential issue concentrates upon prediction, that is, the approximation of a distribution related with the parameter of interest, say , based on the observation of . The predictive distribution is then defined as

A first difference with the standard point estimation perspective is obviously that the parameter vanishes through the integration. A second and more profound difference is that this parameter is not necessarily well-defined anymore. As will become clearer in a following Section, this is a paramount feature in setups where the model is not well-defined and where the statistician hesitates between several (or even an infinity of) models. It is also a case where the standard notion of identifiability is irrelevant, which paradoxically is a ”plus” from the computational point of view, as seen below in, e.g., Example 13.

Example 3

Recall that an model is given as the auto-regressive representation of a time series,

It is often the case that the order of the model is not fixed a priori, but has to be determined from the data itself. Several models are then competing for the “best” fit of the data, but if the prediction of the next value is the most important part of the inference, the order chosen for the best fit is not really relevant. Therefore, all models can be considered in parallel and aggregated through the predictive distribution

which thus amounts to integrating over the parameters of all models, simultaneously:

Note the multiple layers of complexity in this case:

  1. if the prior distribution on has an infinite support, the integral simultaneously considers an infinity of models, with parameters of unbounded dimensions;

  2. the parameter varies from model to model , so must be evaluated differently from one model to another. In particular, if the stationarity constraint usually imposed in these models is taken into account, the constraint on varies22endnote: 2To impose the stationarity constraint when the order of the model varies, it is necessary to reparameterise this model in terms of either the partial autocorrelations or of the roots of the associated lag polynomial. (See, e.g., Robert, 2007, Section 4.5.) between model and model ;

  3. prediction is usually used sequentially: every tick/second/hour/day, the next value is predicted based on the past values . Therefore when moves to , the entire posterior distribution must be re-evaluated again, possibly with a very tight time constraint as for instance in financial or radar tracking applications.

We will discuss this important problem in deeper details after the testing section, as part of the model selection problematic.

0.2.2 Testing hypotheses

A domain where both the philosophy and the implementation of Bayesian inference are at complete odds with the classical approach is the area of testing of hypotheses. At a primary level, this is obvious when opposing the Bayesian evaluation of an hypothesis

with a Neyman–Pearson -value

where is an appropriate statistic, with observed value . The first quantity involves an integral over the parameter space, while the second provides an evaluation over the observational space. At a secondary level, the two answers may also strongly disagree even when the number of observations goes to infinity, although there exist cases and priors for which they agree to the order or even . (See Robert, 2007, Section 3.5.5 and Chapter 5, for more details.)

From a computational point of view, most Bayesian evaluations of the relevance of an hypothesis—also called the evidence—given a sample involve marginal distributions

(1)

where and denote the parameter space and the corresponding prior, respectively, under hypothesis . For instance, the Bayes factor is defined as the ratio of the posterior probabilities of the null and the alternative hypotheses over the ratio of the prior probabilities of the null and the alternative hypotheses, i.e.,

This quantity is instrumental in the computation of the posterior probability

under equal prior probabilities for both and . It is also the central tool in practical (as opposed to decisional) Bayesian testing (Jeffreys, 1961) and can be seen as the Bayesian equivalent of the likelihood ratio.

The first ratio in is then the ratio of integrals of the form (1) and it is rather common to face difficulties in the computation of both integrals.33endnote: 3In this presentation of Bayes factors, we completely bypass the methodological difficulty of defining when is of measure for the original prior and refer the reader to Robert (2007, Section 5.2.3) and Marin and Robert (2007, Section 2.3.2) for proper coverage of this issue.

Example 4 (Continuation of Example 2)

In the case of the generalised linear model, a standard testing situation is to decide whether or not a factor, say, is influential on the dependent variable . This is often translated as testing whether or not the corresponding component of , , is equal to , i.e. . If we denote by the other components of , the Bayes factor for this hypothesis will be

when is the prior constructed for the null hypothesis and when the prior weights of and of the alternative are both equal to . Obviously, besides the normal conjugate case, both integrals cannot be computed in a closed form.

In a related manner, confidence regions are also mostly intractable, being defined through the solution to an implicit equation. Indeed, the Bayesian confidence region for a parameter is defined as the highest posterior region,

(2)

where is determined by the coverage constraint

being the confidence level. While the normalising constant is not necessary to construct a confidence region, the resolution of the implicit equation (2) is rarely straightforward! However, simulation-based equivalents generally produce acceptable approximations in a straightforward manner when both the prior and the likelihood can be numerically computed.

Example 5

Consider a binomial observation with a conjugate prior distribution, . In this case, the posterior distribution is available in closed form,

However, the determination of the ’s such that

with

is not possible analytically. It actually implies two levels of numerical difficulties:

  1. find the solution(s) to ,

  2. find the corresponding to the right coverage,

and each value of examined in step 2. requires a new resolution of step 1. However, can also be interpreted as the quantile of the random variable , hence derived from a large sample of ’s in a Monte Carlo perspective.

The setting is usually much more complex when is a multidimensional parameter, because the interest is usually in getting marginal confidence sets. Example 2 is an illustration of this setting: deriving a confidence region on one component, say, first involves computing the marginal posterior distribution of this component. As in Example 4, the integral

which is proportional to , is most often intractable. Fortunately, the simulation approximations mentioned above are also available to bypass this integral computation.

0.2.3 Model choice

Although they are essentially identical from a conceptual viewpoint, we do distinguish here between model choice and testing, partly because the former leads to further computational difficulties, and partly because it encompasses a larger scope of inferential goals than mere testing. Note first that model choice has been the subject of considerable effort in the past decades, and has seen many advances, including the coverage of problems of higher complexity and the introduction of new concepts. We stress that such advances mostly owe to the introduction of new computational methods.

As discussed in further details in Robert (2007, Chapter 7), the inferential action related with model choice does take place on a wider scale than simple testing: it covers and compares models, rather than parameters, which makes the sampling distribution “more unknown” than simply depending on an undetermined parameter. In some respect, it is thus closer to estimation than to regular testing. In any case, it requires a more precise evaluation of the consequences of choosing the “wrong” model or, equivalently of deciding which model is the most appropriate for the data at hand. It is thus both broader and less definitive than deciding whether is true. At last, the larger inferential scope mentioned in the first point means that we are leaving for a while the well-charted domain of solid parametric models.

From a computational point of view, model choice involves more complex structures that, almost systematically, require advanced tools, like simulation methods which can handle collections of parameter spaces (also called spaces of varying dimensions), specially designed for model comparison.

Example 6

A mixture of distributions is the representation of a distribution (density) as the weighted sum of standard distributions (densities). For instance, a mixture of Poisson distributions, denoted as

has the following density:

This representation of distributions is multi-faceted and can be used in populations with known heterogeneities (in which case a component of the mixture corresponds to an homogeneous part of the population) as well as a non-parametric modelling of unknown populations. This means that, in some cases, is known and, in others, it is both unknown and part of the inferential problem.

First, consider the setting where several (parametric) models are in competition,

the index set being possibly infinite. From a Bayesian point of view, a prior distribution must be constructed for each model as if it were the only and true model under consideration since, in most perspectives except model averaging, only one of these models will be selected and used as the only and true model. The parameter space associated with the above set of models can be written as

(3)

the model indicator being now part of the parameters. So, if the modeller allocates probabilities to the indicator values, that is, to the models , and if she then defines priors on the parameter subspaces , things fold over by virtue of Bayes’s theorem, since one can compute

While a common solution based on this prior modelling is simply to take the (marginal) MAP estimator of , that is, to determine the model with the largest , or even to use directly the average

as a predictive density in in model averaging, a deeper-decision theoretic evaluation is often necessary.

Example 7 (Continuation of Example 3)

In the setting of the models, when the order of the dependence is unknown, model averaging as presented in Example 3 is not always a relevant solution when the statistician wants to estimate this order for different purposes. Estimation is then a more appropriate perspective than testing, even though care must be taken because of the discrete nature of . (For instance, the posterior expectation of is not an appropriate estimator!)

As stressed earlier in this Section, the computation of predictive densities, marginals, Bayes factors, and other quantities related to the model choice procedures is generally very involved, with specificities that call for tailor-made solutions:

  • The computation of integrals is increased by a factor corresponding to the number of models under consideration.

  • Some parameter spaces are infinite-dimensional, as in non-parametric settings and that may cause measure-theoretic complications.

  • The computation of posterior or predictive quantities involves integration over different parameter spaces and thus increases the computational burden, since there is no time savings from one subspace to the next.

  • In some settings, the size of the collection of models is very large or even infinite and some models cannot be explored. For instance, in Example 4, the collection of all submodels is of size and some pruning method must be found in variable selection to avoid exploring the whole tree of all submodels.

0.3 Monte Carlo Methods

The natural approach to these computational problems is to use computer simulation and Monte Carlo techniques, rather than numerical methods, simply because there is much more to gain from exploiting the probabilistic properties of the integrands rather than their analytical properties. In addition, the dimension of most problems considered in current Bayesian Statistics is such that very involved numerical methods should be used to provide a satisfactory approximation in such integration or optimisation problems. Indeed, down-the-shelf numerical methods cannot handle integrals in moderate dimensions and more advanced numerical integration methods require analytical studies on the distribution of interest.

0.3.1 Preamble: Monte Carlo importance sampling

Given the statistical nature of the problem, the approximation of an integral like

should indeed take advantage of the special nature of , namely, the fact that is a probability density44endnote: 4The prior distribution can be used for importance sampling only if it is a proper prior and not a -finite measure. or, instead, that is proportional to a density. As detailed in Chapter II.2 this volume, or in Robert and Casella (2004, Chapter 3) and Robert and Casella (2010), the Monte Carlo method was introduced by Metropolis and Ulam (1949) for this purpose. For instance, if it is possible to generate (via a computer) random variables from , the average

converges (almost surely) to when goes to , according to the Law of Large Numbers. Obviously, if an i.i.d. sample of ’s from the posterior distribution can be produced, the average

(4)

converges to

and it usually is more interesting to use this approximation, rather than

when the ’s are generated from , especially when is flat compared with .

In addition, if the posterior variance is finite, the Central Limit Theorem applies to the empirical average (4), which is then asymptotically normal with variance . Confidence regions can then be built from this normal approximation and, most importantly, the magnitude of the error remains of order , whatever the dimension of the problem, in opposition with numerical methods.55endnote: 5The constant order of the Monte Carlo error does not imply that the computational effort remains the same as the dimension increases, most obviously, but rather that the decrease (with ) in variation has the rate . (See also Robert and Casella, 2004, 2009, Chapter 4, for more details on the convergence assessment based on the CLT.)

The Monte Carlo method actually applies in a much wider generality than the above simulation from . For instance, because can be represented in an infinity of ways as an expectation, there is no need to simulate from the distributions or to get a good approximation of . Indeed, if is a probability density with including the support of , the integral can also be represented as an expectation against , namely

This representation leads to the Monte Carlo method with importance function : generate according to and approximate through

with the weights . Again, by the Law of Large Numbers, this approximation almost surely converges to . And this estimator is unbiased. In addition, an approximation to is given by

(5)

since the numerator and denominator converge to

respectively, if includes . Notice that the ratio (5) does not depend on the normalising constants in either , or . The approximation (5) can therefore be used in settings when some of these normalising constants are unknown. Notice also that the same sample of ’s can be used for the approximation of both the numerator and denominator integrals: even though using an estimator in the denominator creates a bias, (5) does converge to .

While this convergence is guaranteed for all densities with wide enough support, the choice of the importance function is crucial. First, simulation from must be easily implemented. Moreover, the function must be close enough to the function , in order to reduce the variability of (5) as much as possible; otherwise, most of the weights will be quite small and a few will be overly influential. In fact, if is not finite, the variance of the estimator (5) is infinite (see Robert and Casella, 2004, Chapter 3). Obviously, the dependence on of the importance function can be avoided by proposing generic choices such as the posterior distribution .

0.3.2 First illustrations

In either point estimation or simple testing situations, the computational problem is often expressed as a ratio of integrals. Let us start with a toy example to set up the way Monte Carlo methods proceed and highlight the difficulties of applying a generic approach to the problem.

Example 8

Consider a -distribution sample with known. Assume in addition a flat prior as in a non-informative environment. While the posterior distribution on can be easily plotted, up to a normalising constant (Figure 1), because we are in dimension , direct simulation and computation from this posterior is impossible.

Figure 1: Posterior density of in the setting of Example 8 for , with a simulated sample from .

If the inferential problem is to decide about the value of , the posterior expectation is

This ratio of integrals is not directly computable. Since is proportional to a -distribution density, a solution to the approximation of the integrals is to use one of the ’s to “be” the density in both integrals. For instance, if we generate from the distribution, the equivalent of (5) is

since the first term in the product has been “used” for the simulation and the normalisation constants have vanished in the ratio. Figure 2 is an illustration of the speed of convergence of this estimator to the true posterior expectation: it provides the evolution of as a function of both on average and on range (provided by repeated simulations of ). As can be seen from the graph, the average is almost constant from the start, as it should, because of unbiasedness, while the range decreases very slowly, as it should, because of extreme value theory. The graph provides in addition the % empirical confidence interval built on these simulations.66endnote: 6The empirical (Monte Carlo) confidence interval is not to be confused with the asymptotic confidence interval derived from the normal approximation. As discussed in Robert and Casella (2004, Chapter 4), these two intervals may differ considerably in width, with the interval derived from the CLT being much more optimistic! The empirical confidence intervals are decreasing in , as expected from the theory. (This is further established by regressing the log-lengths of the confidence intervals on , with slope equal to , as shown by Figure 3.)

Figure 2: Evolution of a sequence of estimators (8) over iterations: range (in gray), and quantiles, and average, obtained on the same sample as in Figure 1 when simulating from the distribution with location .
Figure 3: Regression of the log-ranges (left) and the log-lengths of the confidence intervals (right) on , for the output in Figure 2.

Now, there is a clear arbitrariness in the choice of in the sample for the proposal . While any of the ’s has the same theoretical validity to “represent” the integral and the integrating density, the choice of ’s closer to the posterior mode (the true value of is ) induces less variability in the estimates, as shown by a further simulation experiment through Figure 4. It is fairly clear from this comparison that the choice of extremal values like and even more is detrimental to the quality of the approximation, compared with the median . The range of the estimators is much wider for both extremes, but the influence of this choice is also visible for the average which does not converge so quickly.77endnote: 7An alternative to the simulation from one distribution that does not require an extensive study on the most appropriate is to use a mixture of the distributions. As seen in Section 0.5.2, the weights of this mixture can even be optimised automatically.

Figure 4: Repetition of the experiment described in Figure 2 for three different choices of : , and (from left to right).

This example thus shows that Monte Carlo methods, while widely available, may easily face inefficiency problems when the simulated values are not sufficiently attuned to the distribution of interest. It also demonstrates that, fundamentally, there is no difference between importance sampling and regular Monte Carlo, in that the integral can naturally be represented in many ways.

Although we do not wish to spend too much space on this issue, let us note that the choice of the importance function gets paramount when the support of the function of interest is not the whole space. For instance, a tail probability, associated with say, should be estimated with an importance function whose support is . (See Robert and Casella, 2004, Chapter 3, for details.)

Example 9 (Continuation of Example 8)

If, instead, we wish to consider the probability that , using the -distribution is not a good idea because negative values of are somehow simulated “for nothing”. A better proposal (in terms of variance) is to use the “folded” -distribution , with density proportional to

on , which can be simulated by taking the absolute value of a rv. All simulated values are then positive and the estimator of the probability is

where the ’s are iid . Note that this is a very special occurrence where the same sample can be used in both the numerator and the denominator. In fact, in most cases, two different samples have to be used, if only because the support of the importance distribution for the numerator is not the whole space, unless, of course, all normalising constants are known. Figure 5 reproduces earlier figures for this problem, when using as the parameter of the distribution.

Figure 5: Evolution of a sequence of estimators (9) over iterations (same legend as Figure 2).

The above example is one-dimensional (in the parameter) and the problems exhibited there can be found severalfold in multidimensional settings. Indeed, while Monte Carlo methods do not suffer from the “curse of dimension” in the sense that the error of the corresponding estimators is always decreasing in , notwithstanding the dimension, it gets increasingly difficult to come up with satisfactory importance sampling distributions as the dimension gets higher and higher. As we will see in Section 0.5, the intuition built on MCMC methods has to be exploited to derive satisfactory importance functions.

Example 10 (Continuation of Example 2)

A particular case of generalised linear model is the probit model,

where denotes the normal cdf. Under a flat prior , for a sample , the corresponding posterior distribution is proportional to

(8)

Direct simulation from this distribution is obviously impossible since the computation of is a difficulty in itself. If we pick an importance function for this problem, the adequation with the posterior distribution will need to be better and better as the dimension increases. Otherwise, the repartition of the weights will get increasingly asymmetric: very few weights will be different from .

Figure 6 illustrates this degeneracy of the importance sampling approach as the dimension increases. We simulate parameters ’s and datasets for dimensions ranging from to , then represented the histograms of the largest weight for . The ’s were simulated from a distribution, while the importance sampling distribution was a distribution.

Figure 6: Comparison of the distribution of the largest importance weight based upon replications of an importance sampling experiment with observations and dimensions .

0.3.3 Approximations of the Bayes factor

As explained in Sections 0.2.2 and 0.2.3, the first computational difficulty associated with Bayesian testing is the derivation of the Bayes factor, of the form

where, for simplicity’s sake, we have adopted a model choice perspective (that is, and may live in completely different spaces).

Specific Monte Carlo methods for the estimation of ratios of normalising constants, or, equivalently, of Bayes factors, have been developed in the past five years. See Chen et al. (2000, Chapter 5) for a complete exposition, as well as Chopin and Robert (2010) and Marin and Robert (2007b) for recent reassessments. In particular, the importance sampling technique is rather well-adapted to the computation of those Bayes factors: Given a importance distribution, with density proportional to , and a sample simulated from , the marginal density for model , , is approximated by

where the denominator takes care of the (possibly) missing normalising constants. (Notice that, if is a density, the expectation of is and the denominator should be replaced by to decrease the variance of the estimator of .)

A compelling incentive, among others, for using importance sampling in the setting of model choice is that the sample can be recycled for all models sharing the same parameters (in the sense that the models are parameterised in the same way, e.g. by their first moments).

Example 11 (Continuation of Example 4)

In the case the ’s are simulated from a product instrumental distribution

the sample of ’s produced for the general model of Example 2, say, can also be used for the restricted model, , where , simply by deleting the first component and keeping the following components, with the corresponding importance density being

Once the ’s have been simulated,88endnote: 8We stress the point that this is mostly an academic exercise as, in regular settings, it is rarely the case that independent components are used for the importance function. the Bayes factor can be approximated by .

However, the variance of may be infinite, depending on the choice of . A possible choice is , with wider tails than , but this is often inefficient if the data is informative because the prior and the posterior distributions will be quite different and most of the simulated values fall outside the modal region of the likelihood. (This is of course impossible when is improper.) For the choice ,

(9)

is the harmonic mean of the likelihoods, but the corresponding variance is infinite when the likelihood has thinner tails than the prior (which is often the case). Having an infinite variance means that the numerical value produced by the estimate cannot be trusted at all, as it may be away from the true marginal by several orders of magnitude. As discussed in Chopin and Robert (2010) and Marin and Robert (2007b), it is actually possible to use a generalised harmonic mean estimate

when is a probability density with tails that are thinner than the posterior tails. For instance, a density with support an approximate HPD region is quite appropriate.

Explicitly oriented towards the computation of ratios of normalising constants, bridge sampling was introduced in Meng and Wong (1996): if both models and cover the same parameter space , if and , where and are unknown normalising constants, then the equality

holds for any bridge function such that both expectations are finite. The bridge sampling estimator is then

where the ’s are simulated from .

For instance, if

then is a ratio of harmonic means, generalising (9). Meng and Wong (1996) have derived an (asymptotically) optimal bridge function

This choice is not of direct use, since the normalising constants of and are unknown (otherwise, we should not need to resort to such techniques!). Nonetheless, it shows that a good bridge function should cover the support of both posteriors, with equal weights if .

Example 12 (Continuation of Example 2)

For generalised linear models, the mean (conditionally on the covariates) satisfies

where is the link function. The choice of the link function usually is quite open. For instance, when the ’s take values in , three common choices of are (McCullagh and Nelder, 1989)

corresponding to the logit, probit and log–log link functions (where denotes the c.d.f. of the distribution). If the prior distribution on the ’s is a normal , and if the bridge function is , the bridge sampling estimate is then

where the are generated from , that is, from the true posteriors for each link function.

The drawback in using bridge sampling is that the extension of the method to cases where the two models and do not cover the same parameter space, for instance because the two corresponding parameter spaces are of different dimensions, requires the introduction of a pseudo-posterior distribution (Chen et al., 2000; Marin and Robert, 2007b) that complete the smallest parameter space so that dimensions match.99endnote: 9Section 0.4.3 covers in greater details the setting of varying dimension problems, with the same theme that completion distributions and parameters are necessary but influential for the performances of the approximation. The impact of those completion pseudo-posteriors on the quality of the Bayes factor approximation is such that they cannot be suggested for a general usage in Bayes factor computation.

A completely different route to the approximation of a marginal likelihood is Chib’s (1995), which happens to be a direct application of Bayes’ theorem: given and , we have that

for every value of (since both the lhs and the rhs of this equation are constant in ). Therefore, if an arbitrary value of , say , is selected—e.g., the MAP estimate—and if a good approximation to can be constructed, denoted , Chib’s (1995) approximation to the marginal likelihood is

(10)

As a first solution, may be a Gaussian approximation based on the MLE, but this is unlikely to be accurate in a general setting. Chib’s (1995) solution is to use a nonparametric approximation based on a preliminary Monte Carlo Markov chain (Section 0.4) sample, even though the accuracy may also suffer in large dimensions. In the special setting of latent variables models (like the mixtures of distributions discussed in Example 6), this approximation is particularly attractive as there exists a natural approximation to , based on the Rao–Blackwell (Gelfand and Smith, 1990) estimate

where the ’s are the latent variables simulated by the Monte Carlo Markov chain algorithm. The estimate is a parametric unbiased approximation of that converges to the true density with rate . This Rao–Blackwell approximation obviously requires the full conditional density to be available in closed form (constant included) but this is the case for many standard models. Figure 7 reproduces an illustration from Marin and Robert (2007b) that compares the variability of Chib’s (1995) method against nearly optimal harmonic and importance sampling approximations in the setting of a probit posterior distribution. While the former solution varies more than the two latter solutions, its generic features make Chib’s (1995) method a reference for this problem.

Figure 7: In the setting of the probit modelling of R Pima Indian dataset, using three covariates and testing for the significance of the diabetes pedigree function, boxplots of 100 Chib’s, harmonic mean and importance estimates of , based on simulations from the prior distributions, for simulations (Source: Marin and Robert, 2007b).

While amenable to an importance sampling technique of sorts, the alternative approach of nested sampling (Skilling, 2006) for computing evidence is discussed in Chopin and Robert (2010) and Robert and Wraith (2009). Similarly, the specific approach based on the Savage-Dickey (Dickey, 1971; Verdinelli and Wasserman, 1995) representation of the Bayes factor associated with the null hypothesis , as

can be related to the family of bridge sampling techniques, even though the initial theoretical foundations of the method are limited, as explained in Robert and Marin (2009). This paper engineered a general framework that produces a converging and unbiased approximation of the Bayes factor, unrelated with the approach of Verdinelli and Wasserman (1995), that builds upon a mixed pseudo-posterior naturally derived from the priors under both the null and the alternative hypotheses.

As can be seen from the previous developments, such methods require a rather careful tuning to be of any use. Therefore, they are rather difficult to employ outside settings where pairs of models are opposed. In other words, they cannot be directly used in general model choice settings where the parameter space (and in particular the parameter dimension) varies across models, as in, for instance, Example 7. To address the computational issues corresponding to these cases requires more advanced techniques introduced in the next Section.

0.4 Markov Chain Monte Carlo Methods

As described precisely in Chapter II.3 and in Robert and Casella (2004), MCMC methods try to overcome the limitation of regular Monte Carlo methods through the use of a Markov chain with stationary distribution the posterior distribution. There exist rather generic ways of producing such chains, including Metropolis–Hastings and Gibbs algorithms. Besides the fact that stationarity of the target distribution is enough to justify a simulation method by Markov chain generation, the idea at the core of MCMC algorithms is that local exploration, when properly weighted, can lead to a valid representation of the distribution of interest, as for instance, when using the Metropolis–Hastings algorithm.

0.4.1 Metropolis–Hastings as universal simulator

The Metropolis–Hastings, presented in Robert and Casella (2004) and Chapter II.3, offers a straightforward solution to the problem of simulating from the posterior distribution : starting from an arbitrary point , the corresponding Markov chain explores the surface of this posterior distribution by a similarly arbitrary random walk proposal that progressively visits the whole range of the possible values of .


—Metropolis–Hastings Algorithm— At iteration Generate , Take θ^(t+1) = tif ut≤π(ξt—x)π(θ(t)—x)q(θ(t)ξt)q(ξtθ(t))θ(t)otherwise
Example 13 (Continuation of Example 10)

In the case , the probit model defined in Example 10 can also be over-parameterised as

since it only depends on . The Bayesian processing of non-identified models poses no serious difficulty as long as the posterior distribution is well defined. This is the case for a proper prior like

that corresponds to a normal distribution on and a gamma distribution on . While the posterior distribution on is not a standard distribution, it is available up to a normalising constant. Therefore, it can be directly processed via an MCMC algorithm. In this case, we chose a Gibbs sampler that simulates and alternatively, from

and

respectively. Since both of these conditional distributions are also non-standard, we replace the direct simulation by a one-dimensional Metropolis–Hastings step, using normal and log-normal random walk proposals, respectively. For a simulated dataset of points, the contour plot of the log-posterior distribution is given in Figure 8, along with the last points of a corresponding MCMC sample after iterations. This graph shows a very satisfactory repartition of the simulated parameters over the likelihood surface, with higher concentrations near the largest posterior regions. For another simulation, Figure 9 details the first steps, when started at . Although each step contains both a and a proposal, some moves are either horizontal or vertical: this corresponds to cases when either the or the proposals have been rejected. Note also the fairly rapid convergence to a modal zone of the posterior distribution in this case.

Figure 8: Contour plot of the log-posterior distribution for a probit sample of observations, along with points of an MCMC sample (Source: Robert and Casella, 2004).
Figure 9: First steps of the Metropolis–Hastings algorithm on the probit log-posterior surface, when started at .

Obviously, this is only a toy example and more realistic probit models do not fare so well with down-the-shelf random walk Metropolis–Hastings algorithms, as shown for instance in Nobile (1998) (see also Robert and Casella, 2004, Section 10.3.2, Marin and Robert, 2007a, Section 4.3, and Marin and Robert, 2007b).1010endnote: 10Even in the simple case of the probit model, MCMC algorithms do not always converge very quickly, as shown in Robert and Casella (2004, Chapter 14).

The difficulty inherent to random walk Metropolis–Hastings algorithms is the scaling of the proposal distribution: it must be adapted to the shape of the target distribution so that, in a reasonable number of steps, the whole support of this distribution can be visited. If the scale of the proposal is too small, this will not happen as the algorithm stays “too local” and, if there are several modes on the posterior, the algorithm may get trapped within one modal region because it cannot reach other modal regions with jumps of too small a magnitude. The larger the dimension is, the harder it is to set up the right scale, though, because

  1. the curse of dimension implies that there are more and more empty regions in the space, that is, regions with zero posterior probability;

  2. the knowledge and intuition about the modal regions get weaker and weaker;

  3. the proper scaling involves a symmetric matrix in the proposal . Even when the matrix is diagonal, it gets harder to scale as the dimension increases (unless one resorts to a Gibbs like implementation, where each direction is scaled separately).

Note also that the on-line scaling of the algorithm against the empirical acceptance rate is inherently flawed in that (a) the process is no longer Markovian and (b) the attraction of a modal region may give a false sense of convergence and lead to a choice of too small a scale, simply because other modes will not be visited during the scaling experiment.

0.4.2 Gibbs sampling and latent variable models

The Gibbs sampler is a definitely attractive algorithm for Bayesian problems because it naturally fits the hierarchical structures so often found in such problems. “Natural” being a rather vague notion from a simulation point of view, it routinely happens that other algorithms fare better than the Gibbs sampler. Nonetheless, Gibbs sampler is often worth a try (possibly with other Metropolis–Hastings refinements at a later stage) in well-structured objects like Bayesian hierarchical models and more general graphical models.

A very relevant illustration is made of latent variable models, where the observational model is itself defined as a mixture model,

Such models were instrumental in promoting the Gibbs sampler in the sense that they have the potential to make Gibbs sampling sound natural very easily. (See also Chapter II.3.) For instance, Tanner and Wong (1987) wrote a precursor article to Gelfand and Smith (1990) that designed specific two-stage Gibbs samplers for a variety of latent variable models. And many of the first applications of Gibbs sampling in the early 90’s were actually for models of that kind. The usual implementation of the Gibbs sampler in this case is to simulate the missing variables conditional on the parameters and reciprocally, as follows:


—Latent Variable Gibbs Algorithm—

At iteration Generate Generate