On the Geometric Ergodicity of Hamiltonian Monte Carlo
We establish general conditions under which Markov chains produced by the Hamiltonian Monte Carlo method will and will not be geometrically ergodic. We consider implementations with both position-independent and position-dependent integration times. In the former case we find that the conditions for geometric ergodicity are essentially a gradient of the log-density which asymptotically points towards the centre of the space and grows no faster than linearly. In an idealised scenario in which the integration time is allowed to change in different regions of the space, we show that geometric ergodicity can be recovered for a much broader class of tail behaviours, leading to some guidelines for the choice of this free parameter in practice.
On the Geo. Erg. of HMC
, , and label=e5]email@example.com
t1Supported by Xerox Research Centre Europe and EPSRC \thankstextt2Supported by EPSRC \thankstextt3Supported by the Royal Society and EPSRC grants EP/P020720/1, EP/J016934/3, EP/K034154/1
class=MSC] \kwd[Primary ]60J05 \kwd[; secondary ]60J20, 60J22, 65C05, 65C40, 62F15, 60H30, 37A50
Markov chain Monte Carlo \kwdMarkov chains \kwdStochastic simulation \kwdHamiltonian dynamics \kwdHamiltonian Monte Carlo \kwdHybrid Monte Carlo \kwdGeometric ergodicity
This paper deals with ergodic properties of Markov chains produced by the Hamiltonian (or Hybrid) Monte Carlo method (HMC), a technique for approximating high dimensional integrals through stochastic simulation . Iterative algorithms of this type are widely used in (for example) statistics and machine learning [20, 2], inverse problems , and molecular dynamics . In many of these settings a prior distribution can be constructed for an unknown quantity, and after conditioning on some observed data, Bayes’ theorem gives a posterior — to extract relevant information from this typically high-dimensional integrals must be evaluated.
A popular approach to such problems is to simulate a Markov chain whose limiting distribution is the posterior, and compute long-run averages (e.g. ). Provided the chain is ergodic, then a Law of Large Numbers exists for these. Several Markov chain Monte Carlo (MCMC) methods of this nature have been proposed in the literature, and many are well understood theoretically (e.g. [41, 42]). HMC has proven an empirical success, with numerous authors noting its superior performance in a variety of settings (e.g. ) and high performance software available for its implementation . Comparatively few rigorous results, however, exist to justify this. Indeed, the absence of such analysis has been noted on more than one occasion [14, 16]. The major contribution of this work is to establish general scenarios under which geometric ergodicity can and cannot be established for Markov chains produced by common HMC implementations.
Consider a Borel space . In the most general case of interest to us is some geodesically complete Riemannian manifold of dimension , but in this article we focus on the case . We define a Markov chain on through an initial distribution and a family of mappings , indexed by an additional random variable defined on the Borel space and with law (e.g. ). A transition kernel can then be induced through the relation
where in this case, with , and is a family of ‘candidate’ maps, with . A candidate transition kernel is induced as for any . The ‘acceptance probability’ is defined via the Radon–Nikodym derivative
as . The resulting chain is reversible with respect to .
Simple choices for the family result in Markov chains which are intuitive and convenient to analyse. In the random walk case , with and a centred, symmetric distribution . For the Metropolis-adjusted Langevin algorithm (MALA) , with a standard Gaussian measure on , a constant, the gradient operator and the Lebesgue density of . The former is in some sense a naive choice, while the latter is an Euler–Maruyama scheme for the diffusion governed by , for which is invariant. In both cases proposals are local (only depending on analytic information at the current point), and is combined with linearly, with added complexity coming only through the (typically nonlinear) . As a result, simple bounds on allow stochastic stability properties such as -irreducibility to be deduced straightforwardly, and rates of convergence for different forms of are also well-understood in both cases [41, 42].
The HMC method can also be considered within the above framework, as outlined in . The algorithm is designed to exploit the measure-preserving properties of Hamiltonian flow (e.g. ), which can be induced provided the state space for the chain is a symplectic manifold (e.g. ). The space can be made symplectic by doubling the dimension, introducing auxiliary momentum variables which follow some user-specified distribution. A Hamiltonian function can then be constructed on the resulting phase space which preserves a distribution for , the -marginal of which will be . At each step of the Markov chain, a fresh value for is drawn from its conditional distribution given the current state, and then the relevant Hamiltonian flow is approximated for units of time to produce the next proposed move. The resulting proposal map is
where denotes the projection operator onto the coordinate, the approximate flow for units of time, and . Typically the distribution for is chosen to be a -dimensional Gaussian. If the law of does not depend on , then the Störmer–Verlet (or leapfrog) numerical integrator is typically used to approximate the flow, with chosen as the integrator step-size and the number of ‘leapfrog steps’ (meaning ). The choice of is a point of ambiguity; often it is set to be some fixed value, however heuristics have also been suggested for choosing this dynamically (e.g. ). For (meaning ) in fact HMC reduces to MALA. In general, however, for (2) will be a non-linear function of , making analysis of the method challenging, particularly in the case of a dynamic .
Our main contribution is to establish conditions under which common HMC implementations produce a geometrically ergodic Markov chain. We also establish instances where convergence will not be geometric, meaning the sampler may perform poorly in practice. We first consider the case where the choice of integration time is chosen independently of the current position, and show that here the non-linear terms in can be bounded in probability as the norm under suitable assumptions, meaning that geometric convergence essentially occurs for HMC in the same scenarios as for MALA, when the tails of are uniformly exponential or lighter, but no lighter than that of a Gaussian density. We then consider an idealised scheme in which is chosen as a function of the current position, and show that in this case geometrically converging chains can be constructed for a much broader class of targets. Although the latter results are in an idealised case, they do offer some practical guidelines for the choice of integration time, which can be used to examine some commonly used heuristics in the literature as well as suggest alternatives.
1.1 Literature review
The HMC method was first introduced in lattice field theory , as a hybrid of two differing approaches to molecular simulation introduced in  and  respectively. A statistically-oriented review is given in . Several extensions have been suggested. A generalized scheme in which the momentum is only partially refreshed was introduced in  (see also ). Other extensions have been proposed to allow more directed motion and reduced rejections (e.g. ). A dynamic approach to tuning the integration time parameter was introduced through the ‘No-U-Turn Sampler’ of , which is now implemented in the Stan software . An extension showing how to implement the sampler on a Riemannian manifold which is globally diffeomorphic to is given in  (see also ), and to embedded manifolds with closed form geodesics in .
Theoretical study of MCMC methods is in the main focused on two themes: convergence to equilibrium and asymptotic variance. The first is often understood through upper bounding some suitable discrepancy between the th iterate of the Markov chain and its limiting distribution, as a function of . When the discrepancy is taken as either the Total Variation or -norm distance (for some suitable Lyapunov function ), then the drift and minorisation conditions popularised in  can be used to show that the distance to equilibrium decreases geometrically in (we elaborate in Section 3). If such a bound holds then for reversible chains a Central Limit Theorem exists for long-run averages of functionals (e.g. ). We take this approach here. Note that such techniques rely crucially on the chain being -irreducible for some -finite measure .
For HMC,  establish that if the potential energy is bounded above, continuous and has bounded derivative then the algorithm will produce a -irreducible chain. The result holds for both the exact flow and the leapfrog integrator variants of HMC. Typically the boundedness assumption on will only be satisfied when is compact. The authors also show that -irreducibility can be established more broadly if the integration time is chosen stochastically. More recently,  consider a continuous-time version of HMC in which the integration step-size is randomly sampled from an Exponential distribution. Under the assumption that Hamilton’s equations can be exactly integrated, they prove that the algorithm will produce a geometrically ergodic Markov chain whenever the tails of decay at a Gaussian rate or faster. The method of the authors is to relate HMC to underdamped Langevin dynamics, the ergodic properties of which are established in . By contrast, we relate HMC to overdamped Langevin dynamics, as analysed in . Although at first this may seem less natural, in fact it allows us to paint a broad picture of when the algorithm as used in practice will and will not produce a geometrically ergodic Markov chain. In  some practical approximations are given for convergence bounds under a positive curvature assumption on the underlying chain. We discuss these further in Section 7. We comment further on connections between HMC and Langevin dynamics in the supplementary material.
Asymptotic variances of long-run averages from Markov chains are often considered via analysing the expected squared jump distance ; at equilibrium this can then be optimised over the various parameters of the dynamics. Careful study of this quantity can also indicate how algorithm performance depends on . In the case of HMC such analysis has been performed , suggesting that the method scales more favourably than other approaches with dimension, and a larger optimal acceptance rate is attained.
Recently a HMC has been generalised to the context of sampling on spaces of infinite dimension . Due to the frequent singularity of measures in such spaces, it is often necessary to characterise distance to equilibrium here through other metrics than Total Variation. Such analysis is beyond the scope of this paper, though we note that recent work in the context of MALA in  and  are useful pre-cursors in this direction.
Let denote a Borel space. Here we restrict attention to (and write for the Euclidean norm of ). For functions let mean that for some . Throughout let be a finite ‘target’ measure, and the corresponding Lebesgue density for some , and let be a distribution defined over . We will denote Lebesgue measure on with , the Dirac point mass at with and the standard Gaussian measure with . We write to denote a Markov transition kernel, meaning is a probability measure for any and is measurable for any . acts to the left on measures through and to the right on functions through . We let and say is invariant for if .
Denote the Total Variation distance between two distributions and on as . We say is a limiting distribution for if as , for -a.e. . Recall that an invariant distribution will be the unique limiting measure if is both -irreducible and aperiodic (e.g. ). We note that the convergence results presented here could equivalently be shown under the -norm distance .
2 Overview of Main Results
The majority of results in this paper concern the version of HMC which is typically used in practice, in which the ‘integration time’ for a typical proposal is chosen independently of the current position in the chain. In this scenario we have the following result.
Assumption A1 introduces a controlled degree of randomness into the integration time parameter, which ensures ergodicity of the HMC transition kernel. Instead of establishing -irreducibility directly on on the multiple step HMC transition, we make a simple stochasticity assumption on the integration time parameter, which allows much of the technical difficulty to be sidestepped. Assumption A2 imposes conditions on the distribution from which expectations are desired, essentially restricting the tail behaviour to be lighter than a Laplacian but no lighter than a Gaussian distribution. This is to ensure that when the chain is very far from the ‘centre’ of the space then typical proposals will bring it back to regions where probability mass concentrates. Assumption A3 relates to the Metropolis–Hastings acceptance rate, ensuring that this does not behave undesirably, in the sense that desirable proposals are often rejected. We make these arguments precise in Section 5.
We also present the following conditions under which Markov chains produced using HMC will not be geometrically ergodic.
If either of the following hold then HMC will not produce a geometrically ergodic Markov chain:
(ii) There is an such that for all , and for every .
The first of these scenarios in essence covers the case where the distribution of interest has lighter tails than those of a Gaussian distribution. In this case explicit numerical solvers for Hamilton’s equations typically become unstable in some regions of the state space. The second is concerned with ‘heavy tailed’ distributions, in which the resulting Hamiltonian flow can be slow, precluding a geometric rate of convergence.
To give some intuition for these results, we consider the Exponential Family class of models first introduced in  and denoted , in which and for all for some it holds that
for some and any . Different choices of correspond to different tail behaviours, with larger values resulting in ‘lighter’ tails. For the density is log-concave, and the specific choices and correspond to Laplace and Gaussian distributions.
For the exponential family class of models , under assumption A1 the following results hold:
(i) For , the Hamiltonian Monte Carlo method will produce a geometrically ergodic chain (for small enough in the case)
(ii) If or , then the Hamiltonian Monte Carlo method will not produce a geometrically ergodic chain
See page 5.2.2. ∎
The results are analogous to those found for the Metropolis-adjusted Langevin algorithm in . A key finding of this work is that when the integration time parameter is chosen in a manner which is independent of the current position, then the two methods essentially coincide in terms of presence or absence of geometric ergodicity. In other words, taking more than a single leapfrog step in the method will not result in a chain ‘becoming’ geometrically ergodic, even though it may still improve the speed of convergence.
We also consider an idealised version of the method in Section 6, in which the integration time is allowed to depend on the current position in a prescribed way. This scheme was designed to mimic several more recent versions of HMC (e.g. ) which are commonly used in modern software packages (e.g. ). For a specific one-dimensional class of smooth exponential family models we find the following
For the one-dimensional class of distributions with densities of the form
then the idealised Hamiltonian Monte Carlo method introduced in Section 6 will produce a geometrically ergodic Markov chain for any choice of .
The positive result in the case where is an artefact of the assumption that Hamilton’s equations can be exactly solved in the idealised scheme - this result would disappear if a typical explicit numerical solver were used instead. However, the findings for the case suggest that there are advantages to using an position-dependent integration time in the presence of heavy tails. We discuss this in more detail in Section 7.
The approach taken here to establishing geometric convergence was popularised in the monograph , who built on earlier concepts from Doeblin and Harris (e.g. ) in particular. A key observation shown in that work is that for -irreducible and aperiodic Markov chains establishing a bound of the form
for some and -a.e. finite is equivalent to finding a -a.e. finite Lyapunov function with ‘small’ level sets, such that the condition holds for some , and some set with .
It is shown in  that if is reversible then (3) is equivalent to the Markov chain exhibiting an absolute spectral gap. In this case then a Central Limit Theorem can be deduced for long-run averages of any (e.g. ).
We are concerned here with specific forms of .
We say is of the Metropolis–Hastings type if
where is a Markov kernel, is defined in (1) and .
If and admit Lebesgue densities and , is bounded away from and on compact sets, and there exists and such that, for every ,
then the Metropolis–Hastings chain with candidate density is -irreducible and aperiodic, and all compact sets are small.
Showing a lack of geometric ergodicity typically requires careful study of the distribution of return times to small sets. The following result of , however, provides a straightforward method for doing this for Metropolis–Hastings kernels.
If is of Metropolis–Hastings type, then (3) fails to hold if .
Lack of geometric ergodicity can also be established in some cases using the following result of .
If for any there is a such that
where , then can be geometrically ergodic only if for some .
If is of Metropolis–Hastings type, it is straightforward to verify that ensures (6), meaning we only need consider the candidate kernel in these cases.
4 Hamiltonian Monte Carlo
We give a brief introduction here. For a more detailed account see  or . We consider probability densities of the form for some . If we view as a ‘potential’ energy in a physical system, it is natural to consider the larger phase space and construct the Hamiltonian
where denotes a -dimensional ‘momentum’ variable, a ‘mass’ matrix and the ‘kinetic’ energy (other forms of kinetic energy are also possible, see e.g. ). Provided is differentiable, we can evolve the coordinates through time in such a way that for any using Hamilton’s equations
Solving (8) results in Hamiltonian flow. To put this presentation into the framework introduced in Section 1, we can consider constructing a measure-preserving map by setting the input to be , choosing a momentum variable , solving (8) for units of time and then projecting back down onto to produce . The parameters define the behaviour of a single map , and how they are chosen define the behaviour of the Markov chain produced by iterating the process of randomly selecting a and then applying the resulting map to the current point to produce the next.
Of course, it is often not possible to solve (8) exactly, so numerical methods are needed. Fortunately, the rich geometric structure of Hamiltonian systems allows the construction of symplectic integrators, which possess attractive long term numerical stability properties (e.g. ), meaning that for appropriate Hamiltonians the approximate solution of (8) is such that for all , where . The standard choice when the Hamiltonian is of the form (7) is the Störmer–Verlet or leapfrog scheme, in which is generated from using steps of the recursion
for some step-size . Although the resulting approximate flow map no longer preserves , it can be used as a proposal mechanism within the Metropolis–Hastings framework (e.g. ). The full method is shown in Algorithm 1 below.
From this point forward we assume for ease of exposition but without loss of generality.
4.1 The marginal chain
To use the techniques of , it is helpful to express the HMC transition in such a way that when is large it is clear how the chain will behave. Although it is typically presented as a map on the larger phase space, HMC can simply be thought of as a Markov chain on , and we will find this representation useful in relation to the above. In this case the candidate map is given by the following proposition.
The HMC candidate map can be written
where , is the number of leapfrog steps and the integrator step-size. With this choice, the acceptance probability will be
Proposition 4.2 highlights the previously noted relationship between HMC and MALA quite explicitly, as setting means the third term on the right-hand side of (9) disappears, leaving the MALA proposal . It also highlights why taking proposes a greater challenge, as for each with this term will typically be a nonlinear transformation of and . As is stochastic, then will be also, but its distribution will often be intractable.
5 Results for an position-independent integration time
In this section we make the assumption that the distribution for the number of leapfrog steps does not depend on the current position. This is relaxed in Section 6.
It is known (e.g. ) that establishing -irreducibility is not so straightforward in the case of HMC as for Metropolis–Hastings methods based on random walks or Langevin diffusions. The canonical example where the system becomes reducible is integrating the harmonic oscillator over precisely one period (e.g. ). We show this in the supplementary material.
The observation noted here and elsewhere that HMC in the case corresponds to MALA, for which irreducibility is established in , can be exploited to alleviate these issues and establish -irreducibility of HMC under the following assumption.
A1 The distribution is such that , and for any fixed and , and that there is an such that .
When assumption A1 holds then the fact that HMC produces an ergodic Markov chain can be straightforwardly invoked from existing MALA results . The idea of randomising the integration time is commonly recommended for practical applications of the method (e.g. [36, 21]), and more theoretical motivation for doing so is given in . The finite exponential expectation condition is needed to ensure that the Lyapunov function used to prove geometric ergodicity is valid. One simple way to ensure this in practice (under the additional assumptions imposed on the potential in the next subsection) is that for some fixed , though weaker conditions than this are also possible.
5.2 Geometric ergodicity
We first present here some seemingly abstract conditions under which the HMC method produces a geometrically ergodic Markov chain. We then give some natural assumptions on the potential under which these hold.
We present the results of this section conditioned on a fixed choice of the number of leapfrog steps , for ease of exposition. Note that the required drift conditions shown hold for a fixed , then under A1 they will hold when possible values for are averaged over according to , so this does not affect the generality of the results.
Notation. We introduce some further notation for this section. Let for some . In the case we will simply write . Let
denote the ‘mean’ next candidate position (), and
denote the proposal ‘drift’ (implying ). We will also sometimes write in a most likely futile attempt to keep things readable.
The HMC method produces a geometrically ergodic Markov chain if assumption A1 holds, and in addition both
where , and
where denotes the ‘potential rejection region’ and the ‘interior’ of .
The last integral asymptotes to zero as by (13). Writing to indicate that depends on , the first integral can be written
Noting that for large enough and using (12) above then setting we can write
The last integral can be bounded above by the moment generating function of a Chi-distributed random variable with degrees of freedom, and so equals . Therefore by (12) the integral asymptotes to a quantity which is strictly less than one if is chosen to be suitably small.
Provided , then for large enough , meaning
which becomes negligibly small as , as required. ∎
5.2.1 Requirements for (12) to be satisfied.
In the case (12) corresponds to
whenever for some . The statements in this section give three simple conditions which establish this are also sufficient to establish (12) when . The main result is stated below. The crucial consequence of this is that controlling the behaviour of ‘global move’ updates produced by HMC when can be done through only ‘local’ knowledge, meaning analytic information at the current point .
For any (12) holds if the following conditions are met
In addition, if (SC1.3) is replaced by
for some , then there is an such that the same result holds provided .
The conditions of the result are intuitive. Condition (SC1.2) ensures that the gradient asymptotically ‘points inwards’, while (SC1.1) and (SC1.3) ensure that grows but at an asymptotically sublinear rate. We begin with a straightforward observation.
Sufficient conditions such that
First note that . Recall the generalised Bernoulli inequality: if and then . Setting , and then we have
This will be strictly less than in the limit as provided . Noting that
then it suffices to see that under (SC1.3) the right-hand side can be made arbitrarily close to zero by taking large enough. ∎
We can also recover some more intuitive sufficient conditions.
A more intuitive condition which implies (SC1.2a) conditional on (SC1.3) and (SC1.1) is
Using (SC1.1) and (SC1.2) gives
For large enough this implies
From now on we refer to (SC1.1), (SC1.2) and (SC1.3) combined as (SC1.1)-(SC1.3). Next we show that these same conditions are sufficient for (12) to hold.
Under the following conditions (12) holds:
Using the generalised Bernoulli inequality as above gives the result. ∎
Next we relate the conditions of Lemma 5.6 to criteria that only depend on the current point . The following lemmas give a starting point.
Provided and (SC1.3) holds then we have the following
(i) For any there is an such that whenever it holds that
(i) Noting that gives
(ii) We have from (i) and (SC1.3) that for any there is an such that whenever ) then . This implies using (i) that , and since can be made arbitrarily small then the result follows.
(iii) , which is using (i) and (ii) and the fact that ∎
Provided and (SC1.3) holds then for any and each the following hold
(i) For any there is an such that whenever it holds that
The results follow iteratively for each using the same approach as in the previous Lemma. For the case then noting that , then (i) in this case follows from Lemma 5.7. It follows that and by an analogous argument to this Lemma. Given this then it can be shown that (i) holds for , and then (ii) and (iii) by the same logic, and the argument can be iterated as many times as is needed. The last claim follows trivially from the second. ∎
Under (SC1.1)-(SC1.3) then for any the conditions of Lemma 5.6 are satisfied.
First we show (i). Writing , then we have , which implies
which can be made arbitrarily small by taking large enough using (SC1.3). Noting that , then an analogous argument can be used to show that will also tend to zero as .
(ii) First note from above that . By an analogous argument to that used in the proof of Corollary 5.5, it is clear therefore that (ii) holds if
which in turn holds if the statement
does. The numerator can be decomposed as