A Common Derivation for Markov Chain Monte Carlo Algorithms with Tractable and Intractable Targets
Markov chain Monte Carlo is a class of algorithms for drawing Markovian samples from high dimensional target densities to approximate the numerical integration associated with computing statistical expectation, especially in Bayesian statistics. However, many Markov chain Monte Carlo algorithms do not seem to share the same theoretical support and each algorithm is proven in a different way. This incurs a large amount of terminologies and ancillary concepts, which makes Markov chain Monte Carlo literature seems to be scattered and intimidating to researchers from many other fields, including new researchers of Bayesian statistics.
A generalised version of the Metropolis–Hastings algorithm is constructed with a random number generator and a self–reverse mapping. This formulation admits many other Markov chain Monte Carlo algorithms as special cases. A common derivation for many Markov chain Monte Carlo algorithms is useful in drawing connections and comparisons between these algorithms. As a result, we now can construct many novel combinations of multiple Markov chain Monte Carlo algorithms that amplify the efficiency of each individual algorithm. Specifically, we reinterpret slice sampling as a special case of Metropolis–Hastings and then propose two novel sampling schemes that combine slice sampling with directional or Hamiltonian sampling. Our Hamiltonian slice sampling scheme is also applicable in the pseudo marginal context where the target density is intractable but can be unbiasedly estimated, e.g. using particle filtering.
ayesian Statistics; Monte Carlo Integration; Markov chain Monte Carlo; Numerical Integration.
Many statistical analyses often require the numerical evaluation of the expectation of an arbitrary function of interest, , with respect to a given density defined as
where is the parametric vector in a given statistical model. This computation appears to be indeed a primal goal in many research areas, especially Bayesian statistics, because the expectation actually encapsulates all statistical information contained in the density . For example, letting for any set will lead to , which is the measure of with respect to the law of distribution given by . Other choices of the function also lead to an even more flexible computational framework. When we are also interested in the expectation of another function with respect to another density , then we can simply define and compute . Furthermore, if we only have access to an unnormalised version of denoted as , then the normalising constant , which can be of interest in itself according to e.g. Skilling (2006); Feroz et al. (2013), can be computed as with being some standard density such as a Gaussian.
Given such a central role and computational flexibility of the integration , there is great utility in designing general–purpose algorithms for approximating this expectation for arbitrary function and density . However, the expectation is also often a high dimensional integration problem, which renders conventional numeric approximation techniques such as Simpsonâs rule ineffective due to e.g. unknown integration boundaries, exponential growth of the computational load with respect to the number of dimensions (Kuo & Sloan, 2005).
Monte Carlo integration helps to reduce the computational load of high dimensional integration by way of random sampling from the target density and relying on the strong law of large number (Gut, 2013) to approximate the required integral as follows
One successful approach in sampling from high dimensional arbitrary density is to simulate a Markov chain with transition kernel that has an invariant density equal to the target density , i.e. . According to the law of total probability, this invariance condition means has to at least satisfy the following equality
A central limit theorem is given by Chan & Geyer (1994) for Markovian sampling as follows
The so called integrated autocorrelation time given in the central limit theorem (2) is a popular measure of inefficiency in using the Markovian samples to estimate , since it is the ratio between the estimator variance and the true variance of with respect to . This factor also highlights the fact that strongly correlated samples will lead to large estimator variance, i.e. inaccurate estimation of . Therefore the central goal in designing Markov chain Monte Carlo algorithms is to deliver Markovian samples that are as uncorrelated as possible. In this regard, some Markov chain Monte Carlo algorithms can outperform others depending on the specific application. However, choosing the best algorithm for a given application among many available sampling schemes is often a nontrivial matter.
The Metropolis–Hastings algorithm, which was pioneered by Metropolis et al. (1953) and later generalised by Hastings (1970), is a fairly general framework to construct the required transition kernel for arbitrary target densities at moderate dimensionality. However, there are many other algorithms for high dimensional applications, such as Gibbs (Geman & Geman, 1984), Hamiltonian (Duane et al., 1987), directional (Gilks et al., 1994), univariate or elliptical slice sampling (Neal, 2003; Murray et al., 2010), that appear on first reading to not fit into the current framework of Metropolis–Hastings as described by Roberts & Rosenthal (2004).
The independent understanding of many algorithms presents a few challenges to the readers of Bayesian statistics literature. First, it is difficult and also time consuming to understand multiple Markov chain Monte Carlo algorithms at once. It can also be much harder to compare or make any theoretical connections between multiple Markov chain Monte Carlo algorithms. Finally, it is difficult to judge when to use which algorithm among a plethora of options.
This work aims to address these difficulties by constructing a common mathematical framework for understanding multiple Markov chain Monte Carlo algorithms at once, which allows both transparent comparisons between these algorithms, and advantageous combinations that amplifies the efficiency of each individual algorithm. We now present this framework in the following section before deriving other Markov chain Monte Carlo algorithms as special cases.
2 Generalised Metropolis–Hastings
2.1 The Current Framework
The conventional Metropolis–Hastings algorithm is essentially constructed from a base proposal density and an acceptance test which is formulated in a way that ensures the correct invariant density of the resulting chain is identical to . Specifically, the potential candidate for the next sample is generated from a tractable density in denoted as
where is constructed from elementary random number generators to approximate the target . The mismatch between versus is corrected by applying an acceptance test where for some random scalar and an acceptance probability defined as
we choose the new sample to be
While being a very elegant approach in Markovian sampling, the Metropolis–Hastings framework cannot be employed to explain many other sampling algorithms. One important aspect of this framework is that the proposal density cannot be defined properly for other samplers, in the sense that while the proposal can be drawn by simulation, the associated density cannot be computed or properly defined in the same sampling space of .
where is some subspace of and . In trying to explain Gibbs sampling using the Metropolis–Hastings framework, we can superficially recognise that the proposal density in this case is identical to . Then since by design and , we can show that the acceptance probability is unity as follows
However, this interpretation of Gibbs sampling is improper because we have ignored the fact that this proposal only has a well–defined density in some subspace of , while the conventional Metropolis–Hastings requires to be a density in (Roberts & Rosenthal, 2004). A proper explanation for Gibbs sampling, isolated from the reasoning of Metropolis–Hastings sampling, can be found in (Tierney, 1994; Chan & Geyer, 1994).
While the problem appears to be only cosmetic in the case of Gibbs sampling, other sampling approaches present more significant deviation from the current framework of Metropolis–Hastings. We can see this in the case of directional sampling, as described by Roberts & Gilks (1994); Gilks et al. (1994); Chen & Schmeiser (1996), which has the following type of proposal
for some random variables and a constant scalar . For any given proposal density , it is not clear how to derive the analytical expression for so the expression for the acceptance probability (4) is to no avail here.
The Hamiltonian Monte Carlo method, which originated from physics literature (Duane et al., 1987), is also actively studied again by e.g. Girolami & Calderhead (2011); Neal (2012) to deal with high dimensional densities. This approach includes some exotic proposal mechanisms such as elliptical slice sampling by Murray et al. (2010) or No-U-Turn sampling by Hoffman & Gelman (2014), which result in both high acceptance rate and low sample correlation. In this approach, the proposal is constructed from the solution of the Hamiltonian differential equations
for a finite random duration such that the trajectory starts at and ends at . The auxiliary variable can be interpreted as a fictional random momentum vector while can now be interpreted as the position of a physical particle in a gravitational field characterised by . The solution to (9) then defines a trajectory through the space that is naturally guided by the gradient of and also preserves the total kinetic and potential energy represented by the fictional Hamiltonian term . This is essentially why the proposed sample can travel very far away from its origin at while maintaining a high acceptance probability, which is indeed unity if (9) is solved analytically. Again in this case, there is really no hope in retrieving an analytical expression for , and hence the current Metropolis–Hastings framework cannot properly explain the Hamiltonian sampling approach.
While we can rest assured that these and other algorithms all have their own theoretical support, the technical concepts and terminologies associated with each of them can be overwhelming to most readers of Bayesian computation literature, especially applied researchers who seek to employ these computational tools in their domain specific applications. Each of these types of proposals, e.g. (6), (8) and (9), leads to different ways to guarantee the invariance condition (1) and various forms of the acceptance probability in each algorithm. More importantly, given any specific target density, it is not at all clear how we can, at least conceptually, judge the pros and cons of each algorithm among an array of algorithms in the Markov chain Monte Carlo literature.
The Metropolis–Hastings algorithm is very well studied, e.g. by Roberts & Rosenthal (2004); Meyn & Tweedie (2009), and widely used, as disccused by Diaconis (2009). Hence there is great service in generalising this algorithm to accomodate a larger class of Markov chain Monte Carlo algorithms. Additionally, a common derivation for many algorithms not only helps to draw comparisons and connections between them, but also provides a flexible framework to synthesise multiple sampling approaches to result in more superior algorithms. This necessary generalisation of the conventional Metropolis–Hastings algorithm will now be presented before other algorithms and their strategic combinations can be given as special cases.
2.2 The Generalised Framework
In order to move to an understanding of multiple sampling approaches within a unified framework, we propose to generalise the Metropolis–Hastings algorithm by decomposing the proposal density into 2 sub–components. The first component is a random number generator
where represents all the random numbers generated in each iteration and can has as many dimensions as needed. For example, in the conventional Metropolis–Hastings proposal (3) we can define . This change of variable is quite trivial for Markov chain Monte Carlo algorithms that are special cases of conventional Metropolis–Hastings. However, it also allows us to describe other proposals such as Gibbs sampling in (6) by defining as comprised of the updated dimensions of as follows
The second component is a deterministic mapping in the expanded space of denoted as
where we are only interested in using to calculate the proposal and the auxiliary vector will simply be discarded. Later, we will show that the proposals (3),(6),(8) and (9) can all be seen as examples of a combination of the random draw (10) followed by the mapping (12).
The Markov transition can only be in fact implemented from these two aforementioned sub–components since any computer operation can only be either a deterministic calculation or a pseudo random number generation. This decomposition is inspired by Green (1995) in the context of model selection, where the number of sampling dimensions changes as the Markov chain proceeds. In what follows, we will apply this idea in the context of Markov chain Monte Carlo sampling and provide the necessary proofs, which also reveal surprising symmetry that exists in many current sampling methods, especially with slice sampling in section (3.6).
Let and denote the joint density of as , the generalised Metropolis–Hasting algorithm to draw can be described as follows
Draw a proposal and compute the generalised acceptance probability as
where is the determinant of the Jacobian of evaluated at .
For some , we set
We often can only compute but not , where the normalising constant is unknown. However, algorithm (1) can be executed without change when is replaced by since will simply be cancelled when computing the acceptance probability.
Algorithm (1) is indeed invariant with respect to the joint density in the expanded space and therefore is also invariant with respect to . Hence, let us simplify the notation by denoting the current and the next location of this Markov chain as
The inverse image of is also denote as . Given , we can derive the distribution of as follows:
where the first term is the probability of starting at and accepting a move to with probability , and similarly, the second term is the probability of starting at and rejecting a move with probability . We can substitute (13) into (14) to result in
Due to the mapping being self–inverse, we can apply the inverse function theorem, e.g. see (de Oliveira, 2013), and find that which reveals that
Each of the transition kernels investigated in this paper corresponds to a self–inverse mapping, which enables the given simple proof that relies on the invariance with respect to not only but also the joint density . This is a special symmetry that exists in many Markov chain Monte Carlo algorithms even though the original framework given by Green (1995) also allows for more general mapping to be used as explained in appendix (.1).
Many other algorithms can be seen as special cases of theorem (1) where the proposal and the mapping take on different forms. We now present some important classes of Markov chain Monte Carlo algorithms under this new framework along with their practical motivation. While a general framework to accommodate multiple algorithms is necessarily more involved than the conventional Metropolis–Hastings, we believe the main value of deriving new formulations for known algorithms is that now we can easily point out important theoretical connections that enable many advantageous combinations between those algorithms.
3 Variants of Generalised Metropolis–Hastings
so that the proposal density is consequently defined as . Since the self–inverse mapping (17) has unity Jacobian, the acceptance probability (13) reduces to the conventional acceptance probability (4). Within this class of algorithms, the Metropolis sampler is a special case when the proposal density is symmetrical in the sense that
which by substitution into (13) will lead to the further simplified acceptance probability
A classic example of this type of proposal is the Gaussian random walk constructed by
where is an adaptive approximation of the global covariance of as studied by Haario et al. (2001); Andrieu & Thoms (2008). The scaling factor is added in (19) can be tuned using a Robbins–Monro recursive adaptation scheme, as given by Andrieu & Thoms (2008); Robbins & Monro (1951), to achieve an optimal running average acceptance rate as recommended by Roberts et al. (1997); Roberts & Rosenthal (2001).
While these adaptation techniques can be successful in moderately high number of dimensions, the optimal acceptance rate is still much less than unity, which leads to increased sample correlation and enlarged estimation error due to the factor in equation (2). Besides, Metropolis sampling also implicitly assumes that the target density is unimodal and the initial choice of approximates the true covariance matrix well. Both of these assumptions are not verifiable a priori and possibly not true in higher dimensional applications. The performance of Metropolis sampling is illustrated for a scalar density using a toy model in example (1), appendix (.5).
3.2 Gibbs Sampling
The main advantage of Gibbs sampling is that we attain unity acceptance probability, given that we can draw exact samples from the conditional densities while leaving the component unchanged. For Gibbs sampling, the mapping is specified as
To get a full update for , we simply have to update the all the subcomponents either sequentially or randomly as discussed by MacEachern & Peruggia (2000), while knowing that no matter how is divided into any number of subcomponents and which subcomponent are being updated, the invariance condition (1) is always satisfies according to theorem (1).
One major drawback for Gibbs sampling and its associated variants, such as auxiliary variables methods by Besag & Green (1993); Damien et al. (1999) or component–wise slice sampling by Neal (2003), is that these samplers can only travel along the Cartesian coordinates of . This behaviour can produce highly correlated samples when the target density has high spatial correlation between its dimensions. One way to circumvent this situation is to estimate the global covariance matrix, as with Metropolis sampling, and then perform an affine transformation of the coordinate system to effectively move along the eigenvectors of instead. One example for this approach in the case of slice sampling is given by Tibbits et al. (2013).
Gibbs–based sampling approaches therefore suffer from the same drawbacks as with Metropolis sampling, with regard to estimating the covariance structure of the target , but to a lesser degree thanks to the unity acceptance probability. These Gibbs–based variants are now presented.
3.3 Generalised Metropolis–Hastings within Gibbs Sampling
For many applications, we can draw exact samples from some but not all of the conditional densities in the subspaces of . Therefore, Gibbs sampling can be a more practical approach if we can relax the exact conditional sampling requirement so it is permissible that in any subspace where direct sampling from the conditional density is not feasible.
It is probably well known that we can embed a conventional Metropolis–Hastings kernel inside a Gibbs sampling scheme. With the new framework, we can show in a fairly straightforward manner that performing any variant of the generalised Metropolis–Hastings within Gibbs sampling is indeed a special case of the generalised Metropolis–Hastings algorithm itself, with a different choice of mapping and proposal density. First, we redefine the mapping as follows
which leads to where any self–reverse mapping would suffice, e.g. can be given by a sub–version of (17) restricted to with unity Jacobian as follows
Furthermore, since and by design, the acceptance probability now becomes identical to that of a general Metropolis–Hastings algorithm with the target being the conditional density instead of the full density , i.e.
As we will see in section (3.7), being able to fit any Markov chain Monte Carlo algorithm in the framework provided by theorem (1) into a Gibbs sampling scheme allows us to strategically combine algorithms that were previously thought to be unrelated. We illustrate the advantage of Gibbs sampling and its immediate extension to (21)–(22) in example (2), appendix (.5).
3.4 Auxiliary Variables Method
The auxiliary variables method is a generalisation of the Swendsen–Wang algorithm for the Ising model in statistical mechanics given by Edwards & Sokal (1988), which is subsequently studied by Besag & Green (1993); Higdon (1998); Damien et al. (1999) and also, from a different viewpoint, by Neal (1997). In the context of high dimensional Markov chain Monte Carlo sampling, we can interpret the auxiliary variables method as a way to seperate out the irregular geometry induced by the data in the target density by introducing auxiliary random variables to transform these irregular components into conditionally uniform densities.
Specifically, when the target density can be factored into a product of factors denoted as , then we can introduce one independent auxiliary random variables for each factor with the following conditional density
so that the joint density of becomes
Since is a marginal density of , which is a uniform density as seen in (24), we can perform Gibbs sampling on to retrieve . To employ Gibbs sampling on , it is trivial to sample from the conditional density using (23), while depending on the complexity of , we can draw exact samples from some if not all of the conditional densities for different subspaces of denoted as
3.5 Slice Sampling
Introducing too many auxiliary variables into an even trivially simple model, as seen in example (3), appendix (.5), can leads to excessive spatial correlation between the original parameters and the auxiliary dimensions . Therefore, Neal (2003) argued that one auxiliary variable is often enough and leads to less correlated Markov samples. This so called slice sampling approach is then simply understood as a special case of auxiliary variable methods where the number of auxiliary factors is , which leads to the following Gibbs sampling scheme
To fully exploit the slice sampling approach, which essentially boils down to uniform sampling under the graph of in , we need to make clear the strong connections between slice sampling and conventional Metropolis–Hastings. By seeing both perspectives side by side, we can easily point out the extra possibilities available in a slice sampling approach.
Let us first propose to solve the sampling task (25) using a Metropolis sampling step in accordance with the Metropolis–Hastings within Gibbs approach given in (21)–(22). Specifically, we can propose , using some symmetrical proposal density , and compute the acceptance probability with respect to the target density (25) as follows
As observed by Higdon (1998), the acceptance probability (26) indeed corresponds to the acceptance probability (18) for an ordinary Metropolis sampler with target density ; since when we define for some , then (26) is also equivalent to
This equivalence between slice sampling and conventional Metropolis–Hastings can also be extended to the case with general, i.e. asymmetrical, proposal density by including as part of the target and consider the following extended slice sampling scheme
which admits and therefore also as its marginal densities.
After the sampling step (28), if we compute the proposal using some self–reverse mapping with Jacobian , then the acceptance probability of with respect to the uniform target density (29) is computed as follows
Finally because defining , for some , is also equivalent to
which is identical to the conventional Metropolis–Hastings acceptance test, we see that the only difference between conventional Metropolis–Hastings and the given extended slice sampling scheme is that is simulated independently from using the proposal instead of (28). While both options leave the target density invariant, we have found in our simulation that (28) gives smaller autocorrelation factor than plain Metropolis–Hastings with .
The given observation has strong implication to all variants of Metropolis–Hastings algorithm. While the conventional presentation of the Metropolis–Hastings algorithm (1) may lead us to often thinking that , or equivalently , is generated only after the proposal becomes available, this ordering is indeed immaterial and should be reversed. As with slice sampling, we can actually condition the density of , and accordingly , upon the specific value of in each iteration. For example, we should propose with larger distance from when is smaller and vice versa. This feature is fully exploited in slice sampling such that the proposal density of is locally adapted to both values of and as shown in the next section. We will also exploit the slice sampling scheme (28–29) to later design the Hamiltonian slice sampler given in section (3.10).
The main contribution of Neal (2003) is therefore not only to motivate the single auxiliary variable approach, but also to design efficient methods for uniform sampling from the set as required in (25) or (28–29), when an analytical solution for the boundary of is often not available. We illustrate the comparative performance of the single auxiliary variable approach motivated by Neal (2003) in example (4), appendix (.5). These methods for uniform sampling from the set , which are the main advantage edge in slice sampling, are now presented.
3.6 Recursive Proposal Generation in Slice Sampling
In conventional Metropolis sampling, rejection is both a necessity and a curse. The proof of theorem (1) shows that rejections play a critical role in maintaining the invariance condition (1). On the other hand, rejection is also a source of sample correlation which increase the factor given in (2). In contrast, slice sampling is an advantageous way to construct an abstract proposal density such that the acceptance probability is unity regardless of the set , i.e. we can guarantee with probability one in each slice sampling iteration.
Specifically, for a uniform density on any bounded set such as defined in (25), Neal (2003) provided a framework for continuing to generate subsequent proposals if the current proposal is rejected. In this framework, we can indeed generate infinitely many sequential proposals. This radical extension of the Metropolis–Hastings framework can be described in algorithm (2), whose condition for validity is accordingly given in theorem (2).
Similarly with , we define as the first accepted proposal in the sequence and as the image of through the mapping . We can conceptually think of algorithm (2) as a way to generate from a special density denoted as
To state the sufficient conditions for algorithm (2), we denote the probability of encountering a rejection in stage with the current sample being as
We similarly define the probability of rejection in stage with the current sample being as
where , i.e. the sequence represent the proposals generated with the same sampling scheme while the current sample being instead of .
The abstract proposal density is symmetrical in the sense that
if, given the notation , the sequence of proposal densities satisfies the following extended Metropolis condition
We can show that (31) indeed leads to (30) and therefore has an unity acceptance probability according to theorem (1). We proceed by observing that the density of can be identified as the sum probability of the mutually exclusive events of having as the first accepted proposal in stage as follows
And similarly, the density of is also written as follows
Algorithm (2) will be invariant with respect to the uniform density if always has an unity Jacobian.
According to section (3.3), since algorithm (2) is a special, and albeit rather unusual, case of theorem (1), we can embed it into the Gibbs sampling scheme (25) to show that slice sampling is indeed a special case of generalised Metropolis–Hastings, where the original target density is transformed into a uniform joint density by introducing .
The levels of symmetry required in theorem (2) seems rather prohibitive but nonetheless can be satisfied in a fairly general manner. Essentially, condition (31) can be guaranteed if the probability of arriving at starting from and vice versa is equal given any combination of rejected samples in between. We now present some specific cases of this approach.
First, let us consider the mapping as with Metropolis–Hastings where
Then we can construct from the sequence of posterior densities of the following artificial inference problem. First, we generate a sequence of artificial random vectors
where represents some arbitrary choice of coordinate origin. Next, we sample from a sequence of posterior densities of given the prior and data as follows
where is the corresponding finite normalising constant. Practically, it is best to put so that , but we will keep the general notation for ease of understanding. Similarly with , we also denote as the first draw that results an accepted sample .
The procedure in (35) explains how we simulate samples from . However, the abstract proposal in the reverse direction must be understood with care. First, the artificial vectors must now be conditioned upon instead of , which leads to so that if the sequence of vectors has the same values in both directions then the sequence of proposed samples will also be the same, i.e. , because
Also we have if is replaced with in (36), which means as expected by the mapping (34). Furthermore, there is an equivalence between the density of and according to the formulation of in (34) as follows
We can utilise these aforementioned properties to prove (31) only by showing that