Randomized Hamiltonian Monte Carlo
Tuning the durations of the Hamiltonian flow in Hamiltonian Monte Carlo (also called Hybrid Monte Carlo) (HMC) involves a tradeoff between computational cost and sampling quality, which is typically challenging to resolve in a satisfactory way. In this article we present and analyze a randomized HMC method (RHMC), in which these durations are i.i.d. exponential random variables whose mean is a free parameter. We focus on the small time step size limit, where the algorithm is rejection-free and the computational cost is proportional to the mean duration. In this limit, we prove that RHMC is geometrically ergodic under the same conditions that imply geometric ergodicity of the solution to underdamped Langevin equations. Moreover, in the context of a multi-dimensional Gaussian distribution, we prove that the sampling efficiency of RHMC, unlike that of constant duration HMC, behaves in a regular way. This regularity is also verified numerically in non-Gaussian target distributions. Finally we suggest variants of RHMC for which the time step size is not required to be small.
t1N. Bou-Rabee was partly supported by NSF grant DMS-1212058. \thankstextt2J. M. Sanz-Serna has been supported by Ministerio de Economía y Comercio, Spain (project MTM2013-46553-C3-1-P).
class=MSC] \kwd[Primary ]60J25 \kwd[; Secondary ]62D05, 60J25, 60H30, 37A50
Randomization \kwdMarkov Chain Monte Carlo \kwdHamiltonian Monte Carlo \kwdHybrid Monte Carlo \kwdLyapunov Functions \kwdGeometric Ergodicity \kwdIntegrated Autocorrelation Time \kwdEquilibrium Mean Squared Displacement
In the present article we suggest a randomized version of the Hamiltonian Monte Carlo (also called Hybrid Monte Carlo) algorithm that, under very general hypotheses, may be proved to be geometrically ergodic. The Hamiltonian Monte Carlo (HMC) algorithm is a general purpose Markov Chain Monte Carlo (MCMC) tool for sampling from a probability distribution [10, 21, 30, 33]. It offers the potential of generating proposed moves that are far away from the current location of the chain and yet may be accepted with high probability. The algorithm is based on integrating a Hamiltonian system and possesses two free parameters: the duration of the Hamiltonian flow and the time step size of the integrator. Unfortunately, the performance of HMC depends crucially on the values assigned by the user to those parameters; while for some parameter values HMC may be highly efficient, it is well known that, as discussed below, there are values for which the algorithm, in its simplest form, is not even ergodic. The Randomized Hybrid Monte Carlo (RHMC) addresses these shortcomings of HMC.
We target probability distributions of the form
where the negative loglikelihood is seen as a potential energy function. We assume that is at least differentiable and such that . As is the case with other MCMC methods, HMC does not require knowing the normalization factor . HMC enlarges the state space from to and considers the Boltzmann-Gibbs distribution in this space
where the artificial Hamiltonian function (or total energy) is taken to be
The vector plays the role of a mechanical momentum and the term is the corresponding kinetic energy. The target is the marginal of on ; the marginal on is, of course, the standard -dimensional Gaussian . More complicated kinetic energies of the form , with a symmetric positive definite mass matrix, may also be used, but, for notational simplicity, we restrict our study to the Hamiltonian (3).
The basic idea of HMC is encapsulated in the following procedure, where the duration is a (deterministic) parameter whose value is specified by the user.
Algorithm 1.1 (Hmc).
Given the duration parameter and the current state of the chain , the method outputs a state as follows.
- Step 1
Generate a -dimensional random vector .
- Step 2
Evolve over Hamilton’s equations associated to (3)
with initial condition .
- Step 3
Since Step 2 conserves the Boltzmann-Gibbs distribution, it is clear that the mapping preserves the target and therefore may be used to generate a Markov Chain having as an invariant distribution (in fact the resulting chain is actually reversible with respect to ). Note that Step 1 is easy to perform since it only involves generating a -dimensional normal random vector. This step is the only source of randomness in determining conditional on . The Hamiltonian flow in Step 2 is what, in principle, enables HMC to make large moves in state space that reduce correlations in the Markov chain . Roughly speaking, one may hope that, by increasing , moves away from , thus reducing correlation. However, simple examples show that this outcome is far from assured. Indeed, for the univariate normal distribution with , Hamilton’s equations coincide with those of the harmonic oscillator, , , and the flow is a rotation in the plane with period . It is easy to see (see Section 5 for a fuller discussion) that, if is taken from the target distribution, as increases from to , the correlation between and decreases and for , and are independent. However increasing beyond will cause an increase of the correlation and for , and the chain is not ergodic. For general distributions, it is likely that a small will lead to a highly correlated chain, while choosing too large may cause the Hamiltonian trajectory to make a U-turn and fold back on itself, thus increasing correlation .
In practice, a formula for the exact solution used in Step 2 is unavailable and a numerical solution is used instead. Thus, in addition to the duration of the Hamiltonian flow in Step 2, another key parameter in the HMC method is the time step size used to generate this numerical solution. To correct the bias introduced by time discretization error, a Metropolis-Hastings accept-reject step is also added [27, 15]. In order to keep the Metropolis-Hastings ratio simple, typically a volume-preserving and reversible method is used to numerically simulate Hamilton’s equations in Step 2 . The integrator of choice is the Verlet method, which is second-order accurate and, like Euler’s rule, only requires one new evaluation of the gradient per step. Unfortunately, time discretization does not remove the complex dependence of correlation on the duration parameter . For instance, in the preceding example where , it is easy to check that if is close to an integer multiple of and is suitably chosen, the Verlet numerical integration will result, for each , in (a move that will be accepted by the Metropolis-Hasting step). To avoid such poor performance, HMC is typically operated with values of that are randomized [22, 30]. Since, due to stability restrictions, explicit integrators cannot operate with arbitrarily large values of the time step, is typically chosen from a uniform distribution in an (often narrow) interval . In any case, the fact remains that increasing the duration parameter will increase the computational cost and may impair the quality of the sampling.
In this paper we randomize the duration of the Hamiltonian flow, cf. [22, 8, 30]. More precisely, in RHMC the lengths of the time intervals of integration of the Hamiltonian dynamics at the different steps of the Markov chain are identically distributed exponential random variables; these durations are mutually independent and independent of the state of the chain. In what follows we are primarily interested in the case where the procedure uses the exact Hamiltonian flow (or, in practical terms, where the integration is carried out with such a small value of which ensures that essentially all steps of the chain result in acceptance). This leaves the mean duration as the only free parameter. In this exact integration scenario, we prove that, regardless of the choice of the mean duration, RHMC is geometrically ergodic. Furthermore, we show that the dependence of the performance of RHMC on this mean duration parameter is simpler than the dependence of the performance of HMC on its constant duration parameter. A full discussion of the situation where time-discretization errors are taken into account requires heavy use of numerical analysis techniques and will be the subject of a future publication. Nevertheless in Section 6 we present two variants of RHMC, based on the ideas of Markov Chain Approximation methods (MCAM) [19, 7], that replace the Hamiltonian flow by a numerical approximation and may be treated with the infinitesimal tools we employ in the exact integration scenario.
Section 2 provides a description of the RHMC method and its infinitesimal generator . In Section 3 we prove that the measure is infinitesimally invariant for . We then construct a Lyapunov function for RHMC that is of the same form as that used for the Langevin equations and requires similar assumptions on the potential energy function [35, 26, 37]. Here it is important to point out that, while the Langevin dynamics explicitly includes friction, the dissipative behavior of RHMC comes from the randomization of the momentum. In particular, if the chain is at a location of high potential energy, the Hamiltonian dynamics will typically change a large part of the potential energy into kinetic energy; then, with high probability, the next momentum randomization will decrease the kinetic energy. With this Lyapunov function, we extend a local semimartingale representation of the process. It also follows that Dynkin’s formula holds for functions that satisfy a mild growth condition. Using Dynkin’s formula we prove that the measure is invariant for RHMC. Using a generating function for the Hamiltonian flow and a Duhamel formula for the Markov semigroup associated with the RHMC process, we prove that the transition probability distribution of RHMC satisfies a minorization condition. We then invoke Harris’ theorem to conclude that RHMC is geometrically ergodic with respect to , for any choice of the mean duration parameter .
Section 4 considers the model problem of a multi-dimensional Gaussian target distribution [22, 30]. For both RHMC and HMC, explicit formulas are derived for (i) the integrated autocorrelation time associated to the standard estimator of the mean of the target distribution; and (ii) the equilibrium mean-squared displacement. These formulas imply that the sampling efficiency of RHMC behaves in a simple way, while the sampling efficiency of HMC displays complicated dependence on the duration parameter. Section 5 presents numerical tests for a two-dimensional double well potential energy and a potential energy used in the chemistry literature for a pentane molecule. These tests support our theoretical findings. Two variants of RHMC that do not assume that integration errors are negligible are suggested in Section 6. The first of them is easily proved to have a Lyapunov function under suitable hypotheses but introduces a bias in the target distribution; the second removes the bias by allowing momentum flips at a rate dictated by a Metropolis-Hastings ratio. There is an Appendix devoted to Harris theorem.
To summarize, the main theoretical contributions of this paper are the following.
a proof of a Foster-Lyapunov drift condition for the infinitesimal generator of RHMC;
a solution to a martingale problem for RHMC;
a proof that infinitesimal invariance of the measure and Dynkin’s formula imply that the measure is invariant for RHMC;
a minorization condition for RHMC;
a proof that is the unique invariant measure of RHMC and that RHMC is geometrically ergodic (which combines all of the previous results); and,
introduced two practical implementations of RHMC, including one that is unbiased.
Let us finish this introduction by pointing out that the RHMC method is related to Anderson’s impulsive thermostat common in molecular dynamics, which describes a molecular system interacting with a heat bath [2, 20, 11]. The molecular system is modeled using Hamiltonian dynamics, and its interaction with a heat bath is modeled by collisions that cause an instantaneous randomization of the momentum of a randomly chosen particle. The times between successive collisions are assumed to be i.i.d. exponential random variables. In Ref. , E and Li prove that the Anderson thermostat on a hyper-torus is geometrically ergodic. Since the state space is bounded, the main issue in that proof is the derivation of a minorization condition for the process. Our proof of geometric ergodicity of the RHMC method can be modified to extend their result to an unbounded space.
2 RHMC Method
Here we provide step-by-step instructions to produce an RHMC trajectory, and afterwards, introduce the infinitesimal generator of RHMC.
The RHMC method generates a right-continuous with left limits (càdlàg) Markov process . While Algorithm 1.1 was formulated in , the process is defined in the enlarged space to include the possibility of partial randomization of the momentum, as in the generalized Hybrid Monte Carlo of Horowitz [17, 18, 1]. This process can be simulated by iterating the following steps. The mean duration and the Horowitz angle are deterministic parameters.
Algorithm 2.1 (Rhmc).
Given the current time and the current state , the method computes the next momentum randomization time and the path of the process over as follows.
- Step 1
Update time via where .
- Step 2
- Step 3
- Step 4
Randomize momentum by setting
- Step 5
On test functions , the infinitesimal generator of is given by
The expectation in the momentum randomization operator is over the random variable defined as
where and . A sample path of this process is given by the Hamiltonian flow associated with (3) with intermittent and instantaneous jumps in momentum. The random times between successive jumps are independent and exponentially distributed with mean . The Horowitz angle is a deterministic parameter that governs how much the momentum immediately after a jump depends on the value of the momentum immediately prior to a jump. The case in (6) has to be excluded because it leads to and then the generator reduces to the Liouville operator associated with the Hamiltonian , which is in general not ergodic with respect to . Note that, if , the random vector does not depend on (complete momentum randomization).
3 Geometric Ergodicity of RHMC
In this section we prove that is the unique invariant probability measure of RHMC and that RHMC is geometrically ergodic. Our main tool to prove geometric ergodicity is Harris Theorem. In Appendix A, we recall this theorem in the present context of a continuous-time Markov process with an uncountable state space.
A main ingredient in Harris theorem is Hypothesis A.1 on the existence of a Lyapunov function, which we refer to as a Foster-Lyapunov drift condition. We formulate this condition in terms of an abstract infinitesimal generator whose precise meaning is given in Definition A.1. Note that this definition of an infinitesimal generator accommodates the Lyapunov function introduced below because for any the process
is always a local martingale, and hence, according to Definition A.1 the function belongs to the domain of . Note that here we assume that this Lyapunov function is continuously differentiable, since the operator involves partial derivatives. After showing that is a Lyapunov function for , we apply this Lyapunov function to solve the martingale problem for the operator on functions that are with globally Lipschitz continuous derivative. This solution is used to show that is an invariant measure for RHMC. We then prove that the transition probabilities of RHMC satisfy a minorization condition given in Hypothesis A.2. With these pieces in place, we invoke Harris Theorem to conclude that is the unique invariant measure for RHMC, and that RHMC is geometrically ergodic.
To establish a Foster-Lyapunov drift condition, we use a Lyapunov function originally introduced to prove stochastic stability of a Hamiltonian system with dissipation and random impulses; see §2.2 and equation (14) of . See also Section 3 of Ref.  and Section 2 of Ref.  for an application of this Lyapunov function to prove geometric ergodicity of the solution to underdamped Langevin equations with non-globally Lipschitz coefficients. This Lyapunov function is of the form:
where is the Hamiltonian given earlier in (3), and and are constants given by:
Since and , note from (8) that both and are positive.
Throughout this section we will use the following conditions on the potential energy . Note that not all of these assumptions will be required for every statement, but we find it notationally convenient to have a single set of assumptions to refer to.
The potential energy satisfies the following conditions.
One has and .
Let and be the constants appearing in (7). Then there exist and such that
for all .
We stress that these assumptions on the potential energy function are typically made to prove geometric ergodicity of the solution to underdamped Langevin equations. For instance, see Equation (13) of Ref. , Hypothesis 1.1 in Ref. , and Condition 3.1 in Ref. . Hypothesis H1 and (7) imply that for any constants , we have that
In other words, this Lyapunov function is integrable with respect to .
The hypothesis that the potential is bounded from below by itself guarantees that the kinetic energy and therefore the momenta are bounded as time increases. It follows that the configuration variables grow at most linearly with time and therefore solutions of Hamilton’s equations are well-defined for all real time.
3.2 The Measure is an Infinitesimally Invariant Measure
As expected, the Boltzmann-Gibbs distribution in (2) is an infinitesimally invariant probability measure for the process . By implication the target is infinitesimally invariant for the process . To state this result, let denote the space of compactly supported smooth functions from to .
Suppose Hypothesis 3.1 H1 holds. Then for any we have that
Moreover, integration by parts shows that the Liouville operator leaves infinitesimally invariant. In particular, the boundary terms resulting from the integration by parts vanish because is compactly supported. ∎
Later in this section, we strengthen this result to is the unique invariant probability measure for RHMC.
3.3 Foster-Lyapunov Drift Condition
The following Lemma is remarkable because it states that the infinitesimal generator possesses a Lyapunov function, even though RHMC does not incorporate explicit dissipation.
This Lemma implies that Hypothesis A.1 of Harris Theorem holds. The proof below shows that the momentum randomizations in RHMC are the source of dissipation in RHMC.
Let . From (7), note that:
and if we set , note from (6) that:
where the expected value is taken over the -dimensional standard normal vector . We recall that we excluded the case in the definition of in (6). A direct calculation shows that:
We now choose and such that
In other words, we pick and to eliminate the cross term in , and to rescale the term so that it matches the coefficient of the term. Solving these equations yields (8). With this choice of and , (10) simplifies to:
Choosing and invoking Hypothesis 3.1 H2, we obtain
To finish the proof, recall that by Hypothesis 3.1 H1, and thus, it suffices to check that the quadratic form
appearing in is positive definite. This condition is met when
which holds for all . ∎
The proof of Lemma 3.2 shows that . Thus, we see that if is smaller or is closer to , then becomes larger. This result is expected since momentum randomizations are the source of dissipation in RHMC, and smaller implies more randomizations of momentum, and larger implies the momentum is randomized more completely.
3.4 Martingale Problem
Here we use Lemma 3.2 to solve the martingale problem for the operator on functions that are with globally Lipschitz continuous derivative.
Suppose Hypothesis 3.1 holds. For all globally Lipschitz and continuously differentiable functions , for any initial condition , and for any , the local martingale
is a martingale.
A key ingredient in the proof given below is the Lyapunov function for from Lemma 3.2. In particular, since globally Lipschitz functions grow at most linearly, the class of functions that appear in Lemma 3.5 are bounded by this Lyapunov function. Moreover, Dynkin’s formula holds for this class of functions:
See, e.g., Chapter 1 of Ref.  for more discussion on Dynkin’s formula.
In this proof, we use the well-known fact (see, e.g., Corollary 3, Chapter II in Ref. ) that a local martingale with integrable quadratic variation is a martingale. Let denote the quadratic variation of on the interval . The global Lipschitz assumption on implies that there exists a constant such that:
for all . Moreover, since the process satisfies Hamilton’s differential equations in between consecutive momentum randomizations, and since the process is continuous and of finite variation, the quadratic variation of is equal to the sum of the squares of the jumps in . Thus,
In other words, the global Lipschitz property of enables bounding the quadratic variation of the scalar-valued process by the quadratic variation of the components of the process .
Let be the sequence of random times at which the momentum randomizations occur. This sequence can be produced by iterating the recurrence relation: with initial condition and where are i.i.d. exponential random variables with mean . Let be the (random) number of momentum randomizations that have occurred up to time . Note that is a Poisson process with intensity , and hence, a.s. bounded. This permits us to interchange expectation and summation in the following inequalities,
where in the last two steps we used the Lyapunov function given in Lemma 3.2, and in addition, we introduced the positive constant with and being the constants defined in (8). Thus, we may conclude that is a martingale for any .
Alternatively, we could have used the compensator of :
which would give a similar bound on . ∎
3.5 The Measure is an Invariant Measure
Suppose Hypothesis 3.1 holds. For any and for any ,
To prove this Lemma, we use Dynkin’s formula and condition on a fixed sequence of jump times to exploit the fact that the Hamiltonian flow and the momentum randomization individually leave invariant.
Let and . Referring to (5), since the function
is smooth, compactly supported in the component, and bounded in the component, and hence, . Moreover, since any smooth function with compact support is globally Lipschitz continuous, we can invoke Lemma 3.5 to conclude that for any and for any , the process
is a martingale. Thus, Dynkin’s formula holds
and in particular,
where we used Fubini’s theorem to interchange time and space integrals. This interchange (and subsequent ones) are valid since the function is bounded by the Lyapunov function, which is an integrable function under the measure . We next argue that the second term on the right hand side of (14) vanishes due to Lemma 3.1 and some basic properties of Hamiltonian flows (volume and Hamiltonian preserving) and the momentum randomization map (Boltzmann-Gibbs preserving).
For this purpose, and as in Lemma 3.2, let denote a realization of the sequence of random times at which the momentum randomizations occur. For any , let be the Hamiltonian flow map associated to . We recall that the jump times and momentum randomizations are mutually independent, and that the number of jumps in any finite time interval is a.s. finite. By change of variables, and using the volume and Hamiltonian preserving properties of the Hamiltonian flow , note that:
holds for any . In addition, since is an invariant measure for the momentum randomization map and , we have the identity:
These facts motivate us to decompose the process into its Hamiltonian and momentum randomization pieces. To do this, and with a slight abuse of notation, we regard the process as an evolution operator and decompose it via
for all . To leverage this decomposition, we use to split the time integral appearing in the second term of the right hand side of (14) into time intervals between consecutive momentum randomizations:
In this form, we can take advantage of the independence between momentum randomizations and jump times, in order to simplify this term. In particular, we condition on the jump times and then average over individual momentum randomizations to obtain
where we sequentially used (15) and (16) from initial time up to final time . Note that to use (16) one has to average over the Gaussian random vector associated to the th momentum randomization for in the inner-most expectation. Substituting this result back into (14) we obtain:
To finish, we invoke Lemma 3.1, which implies that is infinitesimally invariant for , and hence, the second term on the right hand side of this equation vanishes, as required. ∎
3.6 Minorization Condition
For , let denote the transition semigroup of the Markov process
and let denote the associated transition probability distribution
Recall that the process only moves by either the Hamiltonian flow for a random duration or momentum randomizations that are instantaneous. Thus, we expect the semigroup to not have the strong Feller property , since it lacks a sufficient regularizing effect. Nevertheless, we can prove a minorization condition for this process by using the weaker regularizing effect of the momentum randomizations.
Suppose Hypothesis 3.1 holds. For every compact set , there exist a probability measure over , and such that:
holds for all .
To prove this proposition, it is convenient to introduce the following operator:
where is defined as:
Here denotes the derivative with respect to the th component of . Moreover, the generating function can be written as
where solves the Euler-Lagrangian equations with boundary conditions and . In discrete mechanics, the equations (19) and the generating function are known as discrete Hamilton’s equations and the (exact) discrete Lagrangian, respectively .
We adapt to our setting some ideas from the proof of Theorem 2.3 in . To establish the desired result, we use the (weak) regularizing effect of the operator on a function in the momentum degrees of freedom. Since the Hamiltonian flow is regular , this regularizing effect can be transferred to the position degrees of freedom of . Similar results appear in Lemma 2.2 of  and the proof of Theorem 2.2 of .
Specifically, a change of variables shows that:
where we have introduced the transition density of the operator :
To take advantage of (21), we consider the following Duhamel formula:
A second application of this Duhamel formula implies that
A third application of this formula yields
Next we show that Hypothesis A.2 of Harris Theorem holds for the transition probabilities of the RHMC method. To state this Proposition, we will use the total variation (TV) distance between measures. Recall that
where the supremum runs over all measurable sets. In particular, the total variation distance between two probability measures is two if and only if they are mutually singular.
Proposition 3.7 implies the following transition probability is well-defined:
for any satisfying . Therefore,
for all satisfying . Since the TV norm is bounded by , one obtains the desired result. ∎
3.7 Main Result: Geometric Ergodicity
With a Lyapunov function and minorization condition in hand, we are now in position to state a main result of this paper.
According to Lemma 3.2, the generator satisfies a Foster-Lyapunov Drift Condition. Moreover, its associated transition probabilities satisfy Lemma 3.8. Hence, Theorem A.1 implies that (i) the process possesses a unique invariant distribution, and (ii) the transition probability of the process converges to this invariant distribution geometrically fast in the TV metric. Since, by Prop. 3.1, is an infinitesimally invariant measure for the process, and that compactly supported functions are in the domain of , it follows that is the unique invariant measure for the process. ∎
4 Model Problem
For simplicity, we assume in this section and the next that the Horowitz angle is chosen as , i.e., the momentum randomizations are complete rather than partial.
We quantify the sampling capabilities of the RHMC method given in Algorithm 2.1 on a model problem in which the target is a multivariate Gaussian distribution with uncorrelated components, some of them with small variances. We also compare against the standard HMC method given in Algorithm 1.1, where the duration is a fixed parameter. This model problem is discussed in  and analyzed further in . This distribution can be interpreted as a truncation of an infinite-dimensional normal distribution on a Hilbert space . The potential energy function is given by: