Adaptive parallel tempering algorithm

Adaptive parallel tempering algorithm

Abstract.

Parallel tempering is a generic Markov chain Monte Carlo sampling method which allows good mixing with multimodal target distributions, where conventional Metropolis-Hastings algorithms often fail. The mixing properties of the sampler depend strongly on the choice of tuning parameters, such as the temperature schedule and the proposal distribution used for local exploration. We propose an adaptive algorithm which tunes both the temperature schedule and the parameters of the random-walk Metropolis kernel automatically. We prove the convergence of the adaptation and a strong law of large numbers for the algorithm. We illustrate the performance of our method with examples. Our empirical findings indicate that the algorithm can cope well with different kind of scenarios without prior tuning.

\newaliascnt

propositiontheorem \aliascntresettheproposition \newaliascntlemmatheorem \aliascntresetthelemma \newaliascntcorollarytheorem \aliascntresetthecorollary \newaliascntdefinitiontheorem \aliascntresetthedefinition \newaliascntexampletheorem \aliascntresettheexample \newaliascntremarktheorem \aliascntresettheremark

1. Introduction

Markov chain Monte Carlo (MCMC) is a generic method to approximate an integral of the form

where is a probability density function, which can be evaluated point-wise up to a normalising constant. Such an integral occurs frequently when computing Bayesian posterior expectations [e.g., 24, 11].

The random-walk Metropolis algorithm [21] works often well, provided the target density is, roughly speaking, sufficiently close to unimodal. The efficiency of the Metropolis algorithm can be optimised by a suitable choice of the proposal distribution. The proposal distribution can be chosen automatically by several adaptive MCMC algorithms; see [12, 4, 26, 2] and references therein.

When has multiple well-separated modes, the random-walk based methods tend to get stuck to a single mode for long periods of time, which can lead to false convergence and severely erroneous results. Using a tailored Metropolis-Hastings algorithm can help, but in many cases finding a good proposal distribution is difficult. Tempering of , that is, considering auxiliary distributions with density proportional to with often provides better mixing within modes [30, 20, 13, 34]. We focus here particularly on the parallel tempering algorithm, which is also known as the replica exchange Monte Carlo and the Metropolis coupled Markov chain Monte Carlo.

The tempering approach is particularly tempting in such settings where admits a physical interpretation, and there is good intuition how to choose the temperature schedule for the algorithm. In general, choosing the temperature schedule is a non-trivial task, but there are generic guidelines for temperature selection, based on both empirical findings and theoretical analysis. First rule of thumb suggests that the temperature progression should be (approximately) geometric; see, e.g. [16]. Kone and Kofke linked also the mean acceptance rate of the swaps [17]; this has been further analysed by Atchadé, Roberts and Rosenthal [5]; see also [27].

Our temperature adaptation is based on the latter; we try to optimise the mean acceptance rate of the swaps between the chains in adjacent temperatures. Our scheme has similarities with that proposed by Atchadé, Roberts and Rosenthal [5]. The key difference in our method is that we propose to adapt continuously during the simulation. We show that the temperature adaptation converges, and that the point of convergence is unique with mild and natural conditions on the target distribution.

The local exploration in our approach relies on the random walk Metropolis algorithm. The proposal distribution, or more precisely, the scale/shape parameter of the proposal distribution, can be adapted using several existing techniques like the covariance estimator [12] augmented with an adaptive scaling pursuing a given mean acceptance rate [2, 3, 26, 4] which is motivated by certain asymptotic results [25, 28]. It is also possible to use a robust shape estimate which enforces a given mean acceptance rate [33].

We start by describing the proposed algorithm in Section 2. Theoretical results on the convergence of the adaptation and the ergodic averages are given next in Section 3. In Section 4, we illustrate the performance of the algorithm with examples. The proofs of the theoretical results are postponed to Section 5.

2. Algorithm

2.1. Parallel tempering algorithm

The parallel tempering algorithm defines a Markov chain over the product space , where

(1)

Each of the chains targets a ‘tempered’ version of the target distribution . Denote by the inverse temperature, which are such that . and by the normalising constant

(2)

which is assumed to be finite. The parallel tempering algorithm is constructed so that the Markov chain is reversible with respect to the product density

(3)

over the product space .

Each time-step may be decomposed into two successive moves: the swap move and the propagation (or update) move; for the latter, we consider only random-walk Metropolis moves.

We use the following notation to distinguish the state of the algorithm after the swap step (denoted ) and after the random walk step, or equivalently after a complete step (denoted ). The state is then updated according to

(4)

where the two kernels and are respectively defined as

  • denotes the tensor product kernel on the product space

    (5)

    where each is a random-walk Metropolis kernel targeting with increment distribution , where is the density of a multivariate Gaussian with zero mean and covariance ,

    (6)

    where

    (7)

    In practical terms, means that one applies a random-walk Metropolis step separately for each of the chains .

  • denotes the Markov kernel of the swap steps, targeting the product distribution ,

    (8)

    where is the probability of accepting a swap between levels and , which is given by

    (9)

    and

    (10)

    The above defined swap step means choosing a random index uniformly, and then proposing to swap the adjacent states, and . and accepting this swap with probability given in (9).

2.2. Adaptive parallel tempering algorithm

In the adaptive version of the parallel tempering algorithm, the temperature parameters are continuously updated along the run of the algorithm. We denote the sequence of inverse temperatures

(11)

which are parameterised by the vector-valued process

(12)

through and for with

(13)

Because the inverse temperatures are adapted at each iteration, the target distribution of the chain changes from step to step as well. Our adaptation of the temperatures is performed using the following stochastic approximation procedure

(14)

where is defined in (13), is the projection onto the constraint set , which will be discussed further in Section 2.4. Moreover,

(15)
(16)

We will show in Section 3 that the algorithm is designed in such a way that the inverse temperatures converges to a value for which the mean probability of accepting a swap move between any adjacent-temperature chains is constant and is equal to .

We will also adapt the random-walk proposal distribution at each level. We describe below another possible algorithm for performing such a task. In the theoretical part, for simplicity, we will consider only on with the seminal adaptive Metropolis algorithm [12] augmented with scaling adaptation [e.g. 26, 3, 2]. In this algorithm, we estimate the covariance matrix of the target distribution at each temperature and rescale it to control the acceptance ratio at each level in stationarity.

Define by the set of positive definite matrices. For , we denote by and the smallest and the largest eigenvalues of , respectively. For , define by the convex subset

(17)

The set is a compact subset of the open cone of positive definite matrices.

We denote by the current estimate of the covariance at level , which is updated as follows

(18)
(19)

where is the matrix transpose and is the projection on to the set ; see Section 2.4. The scaling parameters is updated so that the acceptance rate in stationarity converges to the target ,

(20)

where is is the projection onto ; see Section 2.4 and is the proposal at level , assumed to be conditionally independent from the past draws and distributed according to a Gaussian with mean and covariance matrix which is given by

(21)

In the sequel we denote by the vector of proposed moves at time step ,

(22)

2.3. Alternate random-walk adaptation

In order to reduce the number of parameters in the adaptation especially in higher dimensions, we propose to use a common covariance for all the temperatures, but still employ separate scaling. More specifically,

(23)
(24)

and set .

Another possible implementation of the random-walk adaption, robust adaptive Metropolis (RAM) [33], is defined by a single dynamic adjusting the covariance parameter and attaining a given acceptance rate. Specifically, one recursively finds a lower-diagonal matrix with positive diagonal satisfying

(25)

where and , and let .

The potential benefit of using this estimate instead of (18)–(20) is that RAM finds, loosely speaking, a ‘local’ shape of the target distribution, which is often in practice close to a convex combination of the shapes of individual modes. In some situations, this proposal shape might allow better local exploration than the global covariance shape.

2.4. Implementation details

In the experiments, we use the desired acceptance rate suggested by theoretical results for the swap kernel [17, 5] and for the random-walk Metropolis kernel [25, 28]. We employ the step size sequences with constants and and . This is a common choice in the stochastic approximation literature.

The projections , and in (14), (18) and (20), respectively, are used to enforce the stability of the adaptation process in order to simplify theoretical analysis of the algorithm. We have not observed instability empirically, and believe that the algorithm would be stable without projections; in fact, for the random-walk adaptation, there exist some stability results [29, 31, 32]. Therefore, we recommend setting the limits in the constraint sets as large as possible, within the limits of numerical accuracy.

It is possible to employ other strategies for proposing swaps of the tempered states. Specifically, it is possible to try more than one swap at each iteration, even go through all the temperatures, without changing the invariant distribution of the chain. We made some preliminary tests with other strategies, but the results were not promising, so we decided to keep the common approach of a single randomly chosen swap.

In the temperature adaptation, it is also possible to enforce the geometric progression, and only adapt one parameter. More specifically, one can use for all and perform the adaptation (14) to update . This strategy might induce more stable behaviour of the temperature parameter especially when the number of levels is high. On the other hand, it can be dangerous because the asymptotic acceptance probability across certain temperature levels can get low, inducing poor mixing.

We consider only Gaussian proposal distributions in the random-walk Metropolis kernels. It is possible to employ also other proposals; in fact our theoretical results extend directly for example to the multivariate Student proposal distributions.

We note that the adaptive parallel tempering algorithm can be used also in a block-wise manner, or in Metropolis-within-Gibbs framework. More precisely, the adaptive random-walk chains can be run as Metropolis-within-Gibbs, and the state swapping can be done in the global level. This approach scales better with respect to the dimension in many situations. Particularly, when the model is hierarchical, the structure of the model can allow significant computational savings. Finally, it is straightforward to extend the adaptive parallel tempering algorithm described above to general measure spaces. For the sake of exposition, we present the algorithm only in .

3. Theoretical results

3.1. Formal definitions and assumptions

Denote by the proposals of the random-walk Metropolis step. We define the following filtration

(26)

By construction, the covariance matrix and the parameters are adapted to the filtration . With these notations and assumptions, for any time step ,

Therefore, denoting , we get

(27)

for all and all bounded measurable functions .

We will consider the following assumption on the target distribution, which ensures a geometric ergodicity of a random walk Metropolis chain [1, 15]. Below, applied to a vector (or a matrix) stands for the Euclidean norm.

  • The density is bounded, bounded away from zero on compact sets, differentiable, such that

    (28)
    (29)

In words, (A3.1) only requires that the target distribution is sufficiently regular, and the tails decay at a rate faster than exponential. We remark that the tempering approach is only well-defined when are integrable with exponents of interest —this is the case always with (A3.1) .

3.2. Geometric ergodicity and continuity of parallel tempering kernels

We first state and prove that the parallel tempering algorithm is geometrically ergodic under (A3.1) . This result might be of independent interest, because geometric ergodicity is well known to imply central limit theorems.

We show that, under mild conditions, this kernel is phi-irreducible, strongly aperiodic, and -uniformly ergodic, where the function is the sum of an appropriately chosen negative power of the target density. Specifically, for , consider the following drift function

(30)

where for ,

(31)

For , define the set

(32)

We denote the -variation of a signed measure as , where the supremum is taken over all measurable functions . The -norm of a function is defined as .

Theorem 1.

Assume (A3.1) . Let and . Then there exists and such that, for all , and ,

(33)

Geometric ergodicity in turn implies the existence of a solution of the Poisson equation, and also provides bounds on the growth of this solution [22, Chapter 17]

Corollary \thecorollary.

Assume (A3.1) . Let and . For any measurable function with for some there exists a unique (up to an additive constant) solution of the Poisson equation

(34)

This solution is denoted . In addition, there exists a constant such that

(35)

We will next establish that the parallel tempering kernel is locally Lipshitz continuous. For any , denote by the -variation of the kernels and ,

(36)

For and , define the set

(37)
Theorem 2.

Assume (A3.1) . Let , and . For any , there exists such that, for any and any , it holds that

3.3. Strong law of large numbers

We can state an ergodicity result for the adaptive parallel tempering algorithm, given the step size sequences satisfy the following natural condition.

  • Assume that the step sizes , defined in (14),(18), and (20) are non-negative and satisfy following conditions

    1. For , and

Remark \theremark.

It is easy to see that with some and and satisfy (A3.3) .

Theorem 3.

Assume (A3.1) - (A3.3) and . Then, for any function such that for some and given exists, we have

Remark \theremark.

In practice, one is usually only interested in integrating with respect to , which means functions depending only on the first coordinate, that is, . In this case, the limit condition is trivial, because for all .

3.4. Convergence of temperature adaptation

The strong law of large numbers (Theorem 3) does not require the convergence of the inverse temperatures, if only the coolest chain is involved (subsection 3.3). It is, however, important to work out the convergence of the adaptation, because then we know what to expect on the asymptotic behaviour of the algorithm. Having the convergence, it is also possible to establish central limit theorems [1]; however, we do not pursue it here.

We denote the associated mean field of the stochastic approximation procedure (14) by

where

We may write

where is the normalising constant defined in (2).

The following result establishes the existence and uniqueness of the stable point of the adaptation. In words, the following result implies that there exist unique temperatures so that the mean rate of accepting proposed swaps is .

Proposition \theproposition.

Assume (A3.1) . Then, there exists a unique solution of the system of equations , .

Remark \theremark.

In Proposition subsection 3.4, it is sufficient to assume that the support of has infinite Lebesgue measure and that for all ; see subsection 5.4.

Remark \theremark.

In case the support of has a finite Lebesgue measure, it is not difficult to show that for a sufficiently large number of levels there is no solution . On the contrary, in formal terms, for , so that the corresponding inverse temperatures for . For our algorithm, this would imply that it simulates asymptotically , the uniform distribution on the support of , with the levels .

For the convergence of the temperature adaptation, we require more stringent conditions on the step size sequence.

  • Assume that step sizes defined in (14),(18),(19) and (20) are non-negative and satisfy following conditions

    1. , , and , .

It is easy to check that the sequences introduced in subsection 3.3 satisfy (A3.4) if we assume in addition that .

Theorem 4.

Assume (A3.1) - (A3.4) , . In addition for all we assume that , where is given by subsection 3.4. Then

4. Experiments

We consider two different type of examples: mixture of Gaussians in Section 4.1 and a challenging spatial imaging example in Section 4.2. In all the experiments, we use the step size sequences , except for RAM adaptation, where (see [33] for a discussion). We did not observe numerical instability issues, so the adaptations were not enforced to be stable by projections. We used the following initial values for the adapted parameters: temperature difference , covariances and scalings .

4.1. Mixture of Gaussians

We consider first a well-known two-dimensional mixture of Gaussians example [e.g. 19, 7]. The example consists of 20 mixture components with means in and each component has a diagonal covariance , with . Figure 1 shows an example of the points simulated by our parallel tempering algorithm in this example, when we use energy levels and the default (covariance) adaptation to adjust the random walk proposals. Figure 2 shows the convergence of the temperature parameters in the same example.

Figure 1. Simulated points of the tempered distributions over 5000 iterations. The random-walk proposal is illustrated as an ellipsoid.
Figure 2. Convergence of the log-temperatures in the mixture example. The 10%–90% quantiles over 100 runs of the algorithm are drawn in grey and the black line shows the median.

We computed estimates of the means and the squares of the coordinates with iterations of which burn-in, and show the mean and standard deviation (in parenthesis) over 100 runs of our parallel tempering algorithm in Table 1. We considered three different random-walk adaptations: the default adaptation in (18)–(20) (Cov), with common mean and covariance estimators as defined in (23)–(24) (Cov(g)) and the RAM update defined in (25). Table 1 shows the results in the same form as [7, Table 3] to allow easy comparison. When comparing with [7], our results show smaller deviation than the unadapted parallel tempering, but bigger deviation than their samplers including also equi-energy moves. We remark that we did not adjust our algorithm at all for the example, but let the adaptation take care of that. There are no significant differences between the random-walk adaptation algorithms.

True value 4.478 4.905 25.605 33.920
Cov 4.469 (0.588) 4.950 (0.813) 25.329 (5.639) 34.209 (8.106)
Cov(g) 4.389 (0.537) 4.803 (0.692) 24.677 (5.411) 32.865 (6.660)
RAM 4.530 (0.524) 4.946 (0.811) 26.111 (5.308) 34.331 (8.292)
Table 1. The test of [7], Table 3 case (A) with temperature levels, 5000 iterations and 2500 burn-in.

When looking the simulated points in Figure 1, it is clear that three temperature levels is enough to allow good mixing in the above example. We repeated the example with energy levels, and increased the number of iterations to in order to have a comparable computational cost. The summary of the results in Table 2 indicates increased accuracy than with levels, and the accuracy comes close to the results reported in [7] for samplers with equi-energy moves.

True value 4.478 4.905 25.605 33.920
Cov 4.480 (0.416) 4.957 (0.571) 25.542 (4.164) 34.420 (5.669)
Cov(g) 4.488 (0.422) 4.884 (0.551) 25.719 (4.190) 33.520 (5.476)
RAM 4.490 (0.407) 4.881 (0.541) 25.667 (4.281) 33.622 (5.631)
Table 2. The test of [7], Table 3 case (A) with temperature levels, 8333 iterations and 4167 burn-in.

We considered also a more difficult modification of the mixture example above. We decreased the variance of the mixture components to and increased the dimension to . The mixture means of the added coordinates were all zero. We ran our adaptive parallel tempering algorithm in this case with temperature levels; Table 3 summarises the results with different number of iterations. In all the cases, the first half of the iterations were burn-in. In this scenario, the different random-walk adaptation algorithms have slightly different behaviour. Particularly, the common mean and covariance estimates (Cov(g)) seem to improve over the separate covariances (Cov). The RAM approach seems to provide the most accurate results. However, we believe that this is probably due to the special properties of the example, namely the fact that all the mixture components have a common covariance, and the RAM converges close to this in the first level; see the discussion in [33].

Est. Cov Cov(g) RAM
3.080 2.245 1.660
27.577 20.426 16.428
1.788 1.580 1.429
18.577 15.712 14.475
1.439 1.267 0.952
15.471 12.769 9.364
1.257 0.975 0.698
13.017 9.414 6.981
1.096 0.680 0.508
11.093 7.038 5.122
Table 3. The root mean square errors in the modified mixture example with , and .

4.2. Spatial imaging

As another example, we consider identifying ice floes from polar satellite images as described by Banfield and Raftery [6]. The image under consideration is a 200 by 200 gray-scale satellite image, and we focus on the same 40 by 40 subregion as in [8]. The goal is to identify the presence and position of polar ice floes. Towards this goal, Higdon [14] employs a Bayesian model with an Ising model prior and following posterior distribution on ,

where neighbourhood relation () is defined as vertical, horizontal and diagonal adjacencies of each pixel. Posterior distribution favours which are similar to original image (first term) and for which the neighbouring pixels are equal (second term).

In [14] and [8], the authors observed that standard MCMC algorithms which propose to flip one pixel at a time fail to explore the modes of the posterior. There are, however, some advantages of using such an algorithm, given we can overcome the difficulty in mixing between the modes. Specifically, in order to compute (the log-difference of) the unnormalised density values, we need only to explore the neighbourhoods of the pixels that have changed. Therefore, the proposal with one pixel flip at a time has a low computational cost. Moreover, such an algorithm is easy to implement.

We used our parallel tempering algorithm with the above mentioned proposal with temperature levels to simulate the posterior of this model with parameters and . We ran 100 replications of iterations of the algorithm. Obtained result are shown in Figure 3 is similar to [14] and [8]. We emphasize again that our algorithm provided good results without any prior tuning.

Figure 3. Spatial model example: original image (left), posterior estimate based on 100 replications of adaptive parallel tempering (right)

5. Proofs

5.1. Proof of Theorem 1

The proof follows by arguments in the literature that guarantee a geometric ergodicity for the individual random-walk Metropolis kernels, and by observing that the swap kernel is invariant under permutation-invariant functions.

We start with an easy lemma showing that a drift in cooler chain implies a drift in the higher-temperature chain.

Lemma \thelemma.

Consider the drift function for some positive constants and . Then, for any ,

Proof.

We write

First term is independent on , since for is non-increasing the second term is also non-increasing with respect to . ∎

To control the ergodicity of each individual random-walk Metropolis sampler, it is required to have a control on the minorisation and drift constants for the kernels . The following proposition provides such explicit control.

Lemma \thelemma.

Assume (A3.1) . Let and . There exist and such that for any , we get

(38)

where , where .

Proof.

It is easily seen that if the target distribution is super-exponential in the tails (A3.1) , then all the tempered versions , where the normalising constant is defined in (2), satisfy (A3.1) as well.

The result then follows from Andrieu and Moulines [1, Proposition 12]. ∎

Proposition \theproposition.

Assume (A3.1) and let and . Then, there exists , and such that, for all and ,

(39)
(40)
(41)
Proof.

By subsection 5.1, since , we get

(42)

Then, by subsection 5.1, since , it holds