On the Infinite Swapping Limit for Parallel Tempering
Parallel tempering, also known as replica exchange sampling, is an important method for simulating complex systems. In this algorithm simulations are conducted in parallel at a series of temperatures, and the key feature of the algorithm is a swap mechanism that exchanges configurations between the parallel simulations at a given rate. The mechanism is designed to allow the low temperature system of interest to escape from deep local energy minima where it might otherwise be trapped, via those swaps with the higher temperature components. In this paper we introduce a performance criteria for such schemes based on large deviation theory, and argue that the rate of convergence is a monotone increasing function of the swap rate. This motivates the study of the limit process as the swap rate goes to infinity. We construct a scheme which is equivalent to this limit in a distributional sense, but which involves no swapping at all. Instead, the effect of the swapping is captured by a collection of weights that influence both the dynamics and the empirical measure. While theoretically optimal, this limit is not computationally feasible when the number of temperatures is large, and so variations that are easy to implement and nearly optimal are also developed.
Key words. Markov processes, pure jump, large deviations, relative entropy, ergodic theory, martingale, random measure
AMS subject classifications. 60J25, 60J75, 60F10, 28D20, 60A10, 60G42, 60G57
The problem of computing integrals with respect to Gibbs measures occurs in chemistry, physics, statistics, engineering and elsewhere. In many situations, there are no viable alternatives to methods based on Monte Carlo. Given an energy potential, there are standard methods to construct a Markov process whose unique invariant distribution is the associated Gibbs measure, and an approximation is given by the occupation or empirical measure of the process over some finite time interval . However, a weakness of these methods is that they may be slow to converge. This happens when the dynamics of the process do not allow all important parts of the state space to communicate easily with each other. In large scale applications this occurs frequently, since the potential function often has complex structures involving multiple deep local minima.
An interesting method called “parallel tempering” has been designed to overcome some of the difficulties associated with rare transitions [4, 6, 21, 22]. In this technique, simulations are conducted in parallel at a series of temperatures. This method does not require detailed knowledge of or intricate constructions related to the energy surface and is a standard method for simulating complex systems. To illustrate the main idea, we first discuss the diffusion case with two temperatures. Discrete time models will be considered later in the paper, and there are obvious analogues for discrete state systems.
Suppose that the probability measure of interest is , where is the temperature and is the potential function. The normalization constant of this distribution is typically unknown. Under suitable conditions on , is the unique invariant distribution of the solution to the stochastic differential equation
where is a -dimensional standard Wiener process. A straightforward Monte Carlo approximation to is the empirical measure over a large time interval of length , namely,
where is the Dirac measure at and denotes a “burn-in” period. When has multiple deep local minima and the temperature is small, the diffusion can be trapped within these deep local minima for a long time before moving out to other parts of the state space. This is the main cause for the inefficiency.
Now consider a second, larger temperature . If and are independent Wiener processes, then of course the empirical measure of the pair
gives an approximation to the Gibbs measure with density
The idea of parallel tempering is to allow “swaps” between the components and . In other words, at random times the component is moved to the current location of the component, and vice versa. Swapping is done according to a state dependent intensity, and so the resulting process is actually a Markov jump diffusion. The form of the jump intensity can be selected so that the invariant distribution remains the same, and thus the empirical measure of can still be used to approximate . Specifically, the jump intensity or swapping intensity is of the Metropolis form , where
and is a constant. Note that the calculation of does not require the knowledge of the normalization constant. A straightforward calculation shows that is the stationary density for the resulting process for all values of [see (LABEL:eqn:echeverria)]. We refer to as the “swap rate,” and note that as increases, the swaps become more frequent.
The intuition behind parallel tempering is that the higher temperature component, being driven by a Wiener process with greater volatility, will move more easily between the different parts of the state space. This “ease-of-movement” is transferred to the lower temperature component via the swapping mechanism so that it is less likely to be trapped in the deep local minima of the energy potential. This, in turn, is expected to lead to more rapid convergence of the empirical measure to the invariant distribution of the low temperature component. There is an obvious extension to more than two temperatures.
Although this procedure is remarkably simple and needs little detailed information for implementation, relatively little is known regarding theoretical properties. A number of papers discuss the efficiency and optimal design of parallel tempering [8, 16, 15]. However, most of these discussions are based on heuristics and empirical evidence. In general, some care is required to construct schemes that are effective. For example, it can happen that for a given energy potential function and swapping rate, the probability for swapping may be so low that it does not significantly improve performance.
There are two aims to the current paper. The first is to introduce a performance criteria for Monte Carlo schemes of this general kind that differs in some interesting ways from traditional criteria, such as the magnitude of the sub-dominant eigenvalue of a related operator [11, 25]. More precisely, we use the theory of large deviations to define a “rate of convergence” for the empirical measure. The key observation here is that this rate, and hence the performance of parallel tempering, is monotonically increasing with respect to the swap rate . Traditional wisdom in the application of parallel tempering has been that one should not attempt to swap too frequently. While an obvious reason is that the computational cost for swapping attempts might become a burden, it was also argued that frequent swapping would result in poor sampling. For a discussion on prior approaches to the question of how to set the swapping rate and an argument in favor of frequent swapping, see [20, 19].
The use of this large deviation criteria and the resulting monotonicity with respect to directly suggest the second aim, which is to study parallel tempering in the limit as . Note that the computational cost due just to the swapping will increase without bound, even on bounded time intervals, when . However, we will construct an alternative scheme, which uses different process dynamics and a weighted empirical measure. Because this process no longer swaps particle positions, it and the weighted empirical measure have a well-defined limit as which we call infinite swapping. In effect, the swapping is achieved through the proper choice of weights and state dependent diffusion coefficients. This is done for the case of both continuous and discrete time processes with multiple temperatures.
An outline of the paper is as follows. In the next section the swapping model in continuous time is introduced and the rate of convergence, as measured by a large deviations rate function, is defined. The alternative scheme which is equivalent to swapping but which has a well defined limit is introduced, and its limit as is identified. The following section considers the analogous limit model for more than two temperatures, and discusses certain practical difficulties associated with direct implementation when the number of temperatures is not small. The continuous time model is used for illustration because both the large deviation rate and the weak limit of the appropriately redefined swapping model take a simpler form than those of discrete time models. However, the discrete time model is what is actually implemented in practice. To bridge the gap between continuous time diffusion models and discrete time models, in Section 4 we discuss the idea of infinite swapping for continuous time Markov jump processes and prove that the key properties demonstrated for diffusion models hold here as well. We also state a uniform (in the swapping parameter) large deviation principle. The discrete time form actually used in numerical implementation is presented in Section 5. Section 6 returns to the issue of implementation when the number of temperatures is not small. In particular, we resolve the difficulty of direct implementation of the infinite swapping models via approximation by what we call partial infinite swapping models. Section 7 gives numerical examples, and an appendix gives the proof of the uniform large deviation principle.
2 Diffusion models with two temperatures
Although the implementation of parallel tempering uses a discrete time model, the motivation for the infinite swapping limit is best illustrated in the setting where the state process is a continuous time diffusion process. It is in this case that the large deviation rate function, as well as the construction of a process that is distributionally equivalent to the infinite swapping limit, is simplest. In order to minimize the notational overhead, we discuss in detail the two temperature case. The extension to models with multiple temperatures is obvious and will be stated in the next section.
2.1 Model setup
Let denote the Markov jump diffusion process of parallel tempering with swap rate . That is, between swaps (or jumps), the process follows the diffusion dynamics (LABEL:eq:two_temp_diff). Jumps occur according to the state dependent intensity function . At a jump time , the particles swap locations, that is, . Hence for a smooth functions the infinitesimal generator of the process is given by
where and denote the gradient and the Hessian matrix with respect to , respectively, and tr denotes trace. Throughout the paper we also assume the growth condition
This condition not only ensures the existence and uniqueness of the invariant distribution, but also enforces the exponential tightness needed for the large deviation principle for the empirical measures.
Recall the definition of in (LABEL:eq:two_temp_dens) and let be the corresponding Gibbs probability distribution, that is,
Straightforward calculations show that for any smooth function which vanishes at infinity
Since the condition (LABEL:eq:growth_rate) implies that as , by the Echeverria’s Theorem [5, Theorem 4.9.17], is the unique invariant probability distribution of the process .
2.2 Rate of convergence by large deviations
It follows from the previous discussion and the ergodic theorem  that, for a fixed burn-in time , with probability one
as . For notational simplicity we assume without loss of generality that from now on. A basic question of interest is how rapid is this convergence, and how does it depend on the swap rate ? In particular, what is the rate of convergence of the lower temperature marginal?
We note that standard measures one might use for the rate of convergence, such as the second eigenvalue of the associated operator, are not necessarily appropriate here. They only provide indirect information on the convergence properties of the empirical measure, which is the quantity of interest in the Monte Carlo approximation. Such measures properly characterize the convergence rate of the transition probability
as . However, they neglect the time averaging effect of the empirical measure, an effect that is not present with the transition probability. In fact, it is easy to construct examples such as nearly periodic Markov chains for which the second eigenvalue suggests a slow convergence when in fact the empirical measure converges quickly .
The magnitude of is used to measure the statistical efficiency of the algorithm. The asymptotic variance is closely related to the spectral properties of the underlying probability transition kernel [7, 17]. However, as with the second eigenvalue the usefulness of this criterion for evaluating performance of the empirical measure is not clear.
In this paper, we use the large deviation rate function to characterize the rate of convergence of a sequence of random probability measures. To be more precise, let be a Polish space, that is, a complete and separable metric space. Denote by the space of all probability measures on . We equip with the topology of weak convergence, though one can often use the stronger -topology . Under the weak topology, is metrizable and itself a Polish space. Note that the empirical measure is a random probability measure, that is, a random variable taking values in the space .
A sequence of random probability measures is said to satisfy a large deviation principle (LDP) with rate function , if for all open sets
for all closed sets
and if is compact in for all .
For our problem all rate functions encountered will vanish only at the unique invariant distribution , and hence give information on the rate of convergence of . A larger rate function will indicate faster convergence, though this is only an asymptotic statement valid for sufficiently large .
2.3 Explicit form of rate function
The large deviation theory for the empirical measure of a Markov process was first studied in . Besides the Feller property, which will hold for all processes we consider, the validity of the LDP depends on two types of conditions. One is a so-called transitivity condition, which requires that there are times and such that for any ,
where indicates that the measure in on the left is absolutely continuous with respect to the measure on the right. For the jump diffusion process we consider here, this condition holds automatically since is bounded on bounded sets, is bounded, and the diffusion coefficients are uniformly non-degenerate. The second type of condition is one that enforces a strong form of tightness, such as (LABEL:eq:growth_rate).
Under condition (LABEL:eq:growth_rate), the LDP holds for and the rate function, denoted by , takes a fairly explicit form because the process is in continuous time and reversible [2, 13]. We will state the following result and omit the largely straightforward calculation since its role here is motivational. [A uniform LDP for the analogous jump Markov process will be stated in Section 4, and its proof is given in the appendix.]
Let be a probability measure on with smooth density. Define . Then can be expressed as
and where for is familiar from the large deviation theory for jump processes.
The key observation is that the rate function is affine in the swapping rate , with the rate function in the case of no swapping. Furthermore, with equality if and only if for -a.e. . This form of the rate function, and in particular its monotonicity in , motivates the study of the infinite swapping limit as .
The limit of the rate function satisfies
Hence for to be finite it is necessary that put exactly the same relative weight as on the points and . Note that if a process could be constructed with as its rate function, then with the large deviation rate as our criteria such a process improves on parallel tempering with finite swapping rate in exactly those situations where parallel tempering improves on the process with no swapping at all.
2.4 Infinite swapping limit
From a practical perspective, it may appear that there are limitations on how much benefit one obtains by letting . When implemented in discrete time, the overall jump intensity corresponds to the generation of roughly independent random variables that are uniform on for each corresponding unit of continuous time, and based on each uniform variable a comparison is made to decide whether or not to make the swap. Hence even for fixed and finite , the computations required to simulate a single trajectory scale like as . Thus it is of interest if one can gain the benefit of the higher swapping rate without all the computational burden. This turns out to be possible, but requires that we view the prelimit processes in a different way.
It is clear that the processes are not tight as , since the number of discontinuities of size will grow without bound in any time interval of positive length. In order to obtain a limit, we consider alternative processes defined by
where is a jump process that switches from state to state with intensity and from state to state with intensity . Compared to conventional parallel tempering, the processes swap the diffusion coefficients at the jump times rather than the physical locations of two particles with constant diffusion coefficients. For this reason, we refer to the solution to (LABEL:eq:prelimit_swap) as the temperature swapped process, in order to distinguish it from the particle swapped process . We illustrate these processes in Figure LABEL:fig:2_temp. Note that the solid line and the dotted line represent and , respectively. These processes have more and more frequent jumps of size as . In contrast, the process have varying diffusion coefficient. The figure attempts to also suggest features of the discrete time setting, with both successful and failed swap attempts.
Clearly the empirical measure of does not provide an approximation to . Instead, we should shift attention between and depending on the value of . Indeed, the random probability measures
have the same distribution as
and hence converge to at the same rate. However, these processes and measures have well defined limits in distribution as . More precisely, we have the following result. For the proof see . Related (but more complex) calculations are needed to prove the uniform large deviation result given in Theorem LABEL:thm:LDP_MP.
Assume that is locally Lipschitz continuous. Then for each the sequence converges in distribution to as , where is the unique strong solution to
The existence and form of the limit are due to the time scale separation between the fast process and the slow process. To give an intuitive explanation of the limit dynamics, consider the prelimit processes (LABEL:eq:prelimit_swap). Suppose that on a small time interval, the value of the slow process does not vary much, say . Given the dynamics of the binary process , it is easy to verify that as tends to infinity the fractions of time that and are and , respectively. This leads to the limit dynamics (LABEL:eq:limit_swap). When mapped back to the particle swapped process, and account for the fraction of time that and , respectively, which naturally leads to the limit weighted empirical measure (LABEL:eq:weighted_emp).
The weights and do not depend on the unknown normalization constant, and in fact
The following properties of the limit system are worth noting.
Instantaneous equilibration of multiple locations. Observe that the lower temperature component of this modified “empirical measure,” i.e., the first marginal, uses contributions from both components at all times, corrected according to the weights. The form of the weights in (LABEL:eq:weighted_emp) guarantees that the contributions to from locations and are at any time perfectly balanced according to the invariant distribution on product space.
Symmetry and invariant distribution. While the marginals of play very different roles, the dynamics of and are actually symmetric. Using the Echeverria’s Theorem [5, Theorem 4.9.17], it can be shown that the unique invariant distribution of the process has the density
It then follows from the ergodic theorem that w.p.1 as . This is hardly surprising, since is the invariant distribution for the prelimit processes .
Escape from local minima. Finally it is worth commenting on the behavior of the diffusion coefficients as a function of the relative positions of and on an energy landscape. Recall that . Suppose that is near the bottom of a local minimum (which for simplicity we set to be zero), while is at a higher energy level, perhaps within the same local minimum. Then
Thus to some degree the dynamics look like
i.e., the particle higher up on the energy landscape is given the greater diffusion coefficient, while the one near the bottom of the well is automatically given the lower coefficient. Hence the particle which is already closer to escaping from the well is automatically given the greater noise (within the confines of ). Recalling the role of the higher temperature particle is to more assiduously explore the landscape in parallel tempering, this is an interesting property.
One can apply results from  to show that the empirical measure of the infinite swapping limit satisfies a large deviation principle with rate function as defined in Remark LABEL:remark:limit-rate. However, to justify the claim that the infinite swapping model is truly superior to the finite swapping variant (note that for any finite ), one should establish a uniform large deviation principle, which would show that is the correct rate function for any sequence such that as . We omit the proof here, since in Theorem LABEL:thm:LDP_MP the analogous result will be proved in the setting of continuous time jump Markov processes.
3 Diffusion models with multiple temperatures
In practice parallel tempering uses swaps between more than two temperatures. A key reason is that if the gap between the temperatures is too large then the probability of a successful swap under the [discrete time version of the] Metropolis rule (LABEL:eq:swap_rate) is far too low for the exchange of information to be effective. A natural generalization is to introduce, to the degree that computational feasibility is maintained, a ladder of higher temperatures, and then attempt pairwise swaps between particles. There are a variety of schemes used to select which pair to attempt the swap, including deterministic and randomized rules for selecting only adjacent temperatures or arbitrary pair of temperatures. However, if one were to replace any of these particle swapped processes with its equivalent temperature swapped analogue and consider the infinite swapping limit, one would get the same system of process dynamics and weighted empirical measures which we now describe.
Suppose that besides the lowest temperature (in many cases the temperature of principal interest), we introduce the collection of higher temperatures
Let be a generic point in the state space of the process and define a product Gibbs distribution with the density
The limit of the temperature swapped processes with temperatures takes the form
To define these weights it is convenient to introduce some new notation.
Let be the collection of all bijective mappings from to itself. has elements, each of which corresponds to a unique permutation of the set , and forms a group with the group action defined by composition. Let denote the inverse of . Furthermore, for each and every , define .
At the level of the prelimit particle swapped process, we interpret the permutation to correspond to event that the particles at location are swapped to the new location . Under the temperature swapped process, this corresponds to the event that particles initially assigned temperatures in the order have now been assigned the temperatures .
The identification of the infinite swapping limit of the temperature swapped processes is very similar to that of the two temperature model in the previous section. By exploiting the time-scale separation, one can assume that in a small time interval the only motion is due to temperature swapping and the motion due to diffusion is negligible. Hence the fraction of time that the permutation is in effect should again be proportional to the relative weight assigned by the invariant distribution to , that is,
then the fraction of time that the permutation is in effect is . Note that for any ,
Going back to the definition of the weights , , it is clear that they represent the limit proportion of time that the -th particle is assigned the temperature and hence will satisfy
Likewise the replacement for the empirical measure, accounting as it does for mapping the temperature swapped process back to the particle swapped process, is given by
The instantaneous equilibration property still holds for the infinite swapping system with multiple temperatures. That is, at any time and given a current position , the weighted empirical measure has contributions from all locations of the form , balanced exactly according to their relative contributions from the invariant density . The dynamics of are again symmetric, and the density of the invariant distribution at point is
The infinite swapping process described above allows the most effective communication between all temperatures, and is the “best” in the sense that it leads to the largest large deviation rate function and hence the fastest rate of convergence. However, computation of the coefficients becomes very demanding for even moderate values of , since one needs to evaluate terms from all possible permutations. In Section LABEL:sec:discrete we discuss a more tractable and easily implementable family of schemes which are essentially approximations to the infinite swapping model presented in the current section and have very similar performance. We call the current model the full infinite swapping model since it uses the whole permutation group , as opposed to the partial infinite swapping model in Section LABEL:sec:discrete where only subgroups of are used.
4 Infinite swapping for jump Markov processes
The continuous time diffusion model is a convenient vehicle to convey the main idea of infinite swapping. In practice, however, algorithms are implemented in discrete time. In this section we discuss continuous time pure jump Markov processes and the associated infinite swapping limit. The purpose of this intermediate step is to serve as a bridge between the diffusion and discrete time Markov chain models. These two types of processes have some subtle differences regarding the infinite swapping limit which is best illustrated through the continuous time jump Markov model.
In this section we discuss the two-temperature model, and omit the completely analogous multiple-temperature counterpart. We will not refer to temperatures and to distinguish dynamics. Instead, let and be two probability transition kernels on given . One can think of as the dynamics under temperature for . We assume that for each the stationary distribution associated with the transition kernel admits the density in order to be consistent with the diffusion models, and define
We assume that the kernels are Feller and have a density that is uniformly bounded with respect to Lebesgue measure. These conditions would always be satisfied in practice. Finally, we assume that the detailed balance or reversibility condition holds, that is,
4.1 Model setup
In the absence of swapping [i.e., swapping rate ], the dynamics of the system are as follows. Let denote a continuous time process taking values in . The probability transition kernel associated with the embedded Markov chain, denoted by , is
Without loss of generality, we assume that the jump times occur according to a Poisson process with rate one. In other words, let be a sequence of independent and identically distributed (iid) exponential random variables with rate one that are independent of . Then
The infinitesimal generator of is such that for a given smooth function ,
Owing to the detailed balance condition (LABEL:eqn:detailed_balance), the operator is self-adjoint. Using arguments similar to but simpler than those used to prove the uniform LDP in Theorem LABEL:thm:LDP_MP, the large deviations rate function associated with the occupation measure
can be explicitly identified: for any probability measure on with and ,
and is extended to all of by lower semicontinuous regularization.
4.2 Finite swapping model
Denote by the state process of the finite swapping model with swapping rate , and let be the embedded Markov chain. The probability transition kernel for is
where is defined as in (LABEL:eq:swap_rate). Furthermore, let be a sequence of iid exponential random variables with rate and define
In other words, the jumps occur according to a Poisson process with rate . Note that there are two types of jumps. At any jump time, with probability it will be a jump according to the underlying probability transition kernels and . With probability it will be an attempted swap which will succeed with the probability determined by . As grows, the swap attempts become more and more frequent. However, the time between two consecutive jumps of the first type will have the same distribution as
where is a geometric random variable with parameter . It is easy to argue that for any the distribution of is exponential with rate one. This observation will be useful when we derive the infinite swapping limit.
The infinitesimal generator of is such that for any smooth function on
It is not difficult to check that the stationary distribution of remains and that is self-adjoint. As before, the large deviation rate function for the occupation measure
can be explicitly identified. Indeed, for any probability measure on with and