Self-Healing Umbrella Sampling ††thanks: This work is supported by the European Research Council under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement number 614492 and by the French National Research Agency under the grant ANR-12-BS01-0019 (STAB). We also benefited from the scientific environment of the Laboratoire International Associé between the Centre National de la Recherche Scientifique and the University of Illinois at Urbana-Champaign.
The Self-Healing Umbrella Sampling (SHUS) algorithm is an adaptive biasing algorithm which has been proposed in SHUS in order to efficiently sample a multimodal probability measure. We show that this method can be seen as a variant of the well-known Wang-Landau algorithm wang-landau-01 ; wang-landau-01-PRL . Adapting results on the convergence of the Wang-Landau algorithm obtained in fort:jourdain:kuhn:lelievre:stoltz:2014 , we prove the convergence of the SHUS algorithm. We also compare the two methods in terms of efficiency. We finally propose a modification of the SHUS algorithm in order to increase its efficiency, and exhibit some similarities of SHUS with the well-tempered metadynamics method barducci-bussi-parrinello-08 .
Keywords:Wang-Landau algorithm Stochastic Approximation Algorithm Free energy biasing techniques
The efficient sampling of a probability measure defined over a high dimensional space is required in many application fields, such as computational statistics or molecular dynamics lelievre-rousset-stoltz-book-10 . Standard algorithms consist in building a dynamics which is ergodic with respect to the target distribution such as Langevin dynamics lelievre-rousset-stoltz-book-10 ; roberts-tweedie-96 or Metropolis-Hastings dynamics MRRTT53 ; Hastings70 . Averages over trajectories of this ergodic dynamics are then used as approximations of averages with respect to the target probability measure. In many cases of interest, this probability measure is multimodal: regions of high probability are separated by regions of low probability, and this implies that the ergodic dynamics is metastable. This means that it takes a lot of time to leave a high probability region (called a metastable state). The consequence of this metastable behavior is that trajectorial averages converge very slowly to their ergodic limits.
Many techniques have been proposed to overcome these difficulties. Among them, importance sampling consists in modifying the target probability using a well-chosen bias in order to enhance the convergence to equilibrium. Averages with respect to the original target are then recovered using a reweighting of the biased samples. In general, it is not easy to devise an appropriate bias. Adaptive importance sampling methods have thus been proposed in order to automatically build a “good” bias (see (lelievre-rousset-stoltz-book-10, , Chapter 5) for a review of these approaches).
Let us explain the principle of adaptive biasing techniques in a specific setting (we refer to (lelievre-rousset-stoltz-book-10, , Chapter 5) for a more general introduction to such methods). To fix the ideas, let us consider a target probability measure on the state space , where denotes the Lebesgue measure on , and let us be given a partition of into subsets. The subsets are henceforth called ’strata’. The choice of such a partition will be discussed later (see Footnote 1 below). We assume that the target measure is multimodal in the sense that the weights of the strata span several orders of magnitude. In other words, the weights of the strata
vary a lot and is large. We suppose from now on and without loss of generality (up to removing some strata) that . In such a situation, it is natural to consider the following biased probability density:
which is such that
for all : under the biased probability measure , each stratum has the same weight. In particular, the ergodic dynamics which are built with as the target measure are typically less metastable than the dynamics with target . The practical difficulty to implement this technique is of course that the vector
is unknown. The principle of adaptive biasing methods is to learn on the fly the vector in order to eventually sample the biased probability measure . Adaptive algorithms thus build a sequence of vectors which is expected to converge to . Various updating procedures have been proposed (lelievre-rousset-stoltz-book-10, , Chapter 5). Such adaptive techniques are used on a daily basis by many practitioners in particular for free energy computations in computational statistical physics. In this context, the partition of is related to the choice of a so-called reaction coordinate111For a given measure , the choice of a good partition or, equivalently, of a good reaction coordinate does of course influence the efficiency of the algorithm. This choice is a difficult problem that we do not consider here (see for example chopin-lelievre-stoltz-12 for such discussions in the context of computational statistics). Here, both and are assumed to be given., and the weights give the free energy profile associated with this reaction coordinate. We focus here on a specific adaptive biasing method called the Self-Healing Umbrella Sampling (SHUS) technique SHUS ; dickson-legoll-lelievre-stoltz-fleurat-lessard-10 . We will show that it is a variant of the well-known Wang-Landau method wang-landau-01 ; wang-landau-01-PRL . From a practical viewpoint, the main interest of SHUS compared to Wang-Landau is that the practitioner has less numerical parameters to tune (as will be explained in Section 2.3).
The aim of this paper is to analyze the SHUS algorithm in terms of convergence and efficiency. First, we adapt the results of fort:jourdain:kuhn:lelievre:stoltz:2014 which prove the convergence of Wang-Landau to obtain the convergence of SHUS (see Theorem 3.2). Second, we perform numerical experiments to analyze the efficiency of SHUS, in the spirit of amrx where similar numerical tests are performed for the Wang-Landau algorithm. The efficiency analysis consists in estimating the average exit time from a metastable state in the limit when goes to infinity. Adaptive techniques (such as SHUS or Wang-Landau) yield exit times which are much smaller than for the original non-adaptive dynamics.
The main output of this work is that, both in terms of convergence (longtime behavior) and efficiency (exit times from metastable states), SHUS is essentially equivalent to the Wang-Landau algorithm for a specific choice of the numerical parameters. These numerical parameters are not the optimal ones in terms of efficiency and we propose in Section 5.2 a modified SHUS algorithm which is (in the longtime limit) equivalent to the Wang-Landau algorithm with better sets of parameters.
This article is organized as follows. In Section 2, we introduce the SHUS algorithm, check its asymptotic correctness and explain how it can be seen as a Wang-Landau algorithm with stochastic stepsize sequence. In Section 3, we state a convergence result for Wang-Landau algorithms with general (either deterministic or stochastic) stepsize sequences and deduce the convergence of SHUS. The proofs are based on stochastic approximation arguments and postponed to Section 6. Numerical results illustrating the efficiency of the algorithm and comparing its performance with the standard Wang-Landau algorithm are provided in Section 4. Finally, in Section 5, we draw some conclusions on the interest of SHUS compared with the Wang-Landau algorithm, and further compare SHUS with other adaptive techniques, such as the well-tempered metadynamics algorithm barducci-bussi-parrinello-08 and the above-mentionned modified SHUS algorithm. We also prove the convergence of this modified SHUS algorithm (see Proposition 4), and present numerical results showing that this new method is closely related to a Wang-Landau dynamics with larger stepsizes.
2 The SHUS algorithm
Using the notation of the introduction, we consider a target probability measure on the state space and a partition of into strata .
We introduce a family of biased densities , for , where is the set of positive probability measures on :
The biased densities are obtained from by a reweighting of each stratum:
where is defined by (1). Observe that for any and ,
2.1 Description of the algorithm
As explained above, the principle of many adaptive biasing techniques, and SHUS in particular, is to build a sequence which converges to . This allows to sample , which is less multimodal than . In order to understand how the updating rule for is built for SHUS, one may proceed as follows.
Let us first assume that we are given a Markov chain which is ergodic with respect to the target measure (think of a Metropolis-Hastings dynamics). Let us introduce the sequence (for a given )
which, in some sense, counts the number of visits to each stratum. By the ergodic property, it is straightforward to check that converges almost surely (a.s.) to as . As explained in the introduction, the difficulty with this algorithm is that the convergence of to is very slow due to the metastability of the density and thus of the Markov chain .
The idea is then that if an estimate of is available, one should instead consider a Markov chain which is ergodic with respect to and thus hopefully less metastable. To estimate with this new Markov chain, one should modify the updating rule (6) as
in order to unbias the samples (since ). Again, by the ergodic property, one easily gets that converges a.s. to as . Indeed, converges a.s. to (given by (5)) as . Since
converges a.s. to , which implies in turn
Hence converges a.s. to as .
The SHUS algorithm consists in using the current value as the estimate of in the previous algorithm. Let us now precisely define the SHUS algorithm. Let be a family of transition kernels on which are ergodic with respect to . In particular,
Let , and be deterministic. The SHUS algorithm consists in iterating the following steps:
compute the probability measure on ,
draw according to the kernel ,
compute, for all ,
Notice that only simulation under is needed to implement SHUS. By choosing as a Metropolis-Hastings kernel with target measure , the density needs to be known only up to a normalizing constant. Therefore, SHUS covers the case when the target measure is only known up to a multiplicative constant which is generally the case in view of applications to Bayesian statistics and statistical physics.
As proved in Section 3, converges almost surely to when . According to the update mechanism (10), is non negative, and it is positive if and only if the current draw lies in stratum . In addition, this variation is proportional to the current approximation of with a factor chosen by the user (prior to the run of the algorithm; the choice of is numerically investigated in Section 4).
The principle of this algorithm is thus to penalize the already visited strata, in order to favor transitions towards unexplored strata. The penalization strength is proportional to the current bias of the strata. The prefactor in (10) can be understood as a way to unbias the samples in order to recover samples distributed according to the target measure , see also formula (17) below.
2.2 Asymptotic correctness
Let us consider the SHUS algorithm 1 and let us assume that converges to some value, say . It is then expected that converges to and that the updating rule (10) leads to the same asymptotic behavior as (7) (where in (10) has been replaced by its limit ). Thus, it is expected that a.s. by (8) and thus a.s.. Therefore, the only possible limit of the sequence is .
This heuristic argument is of course not a proof of convergence, but it explains why one can expect the SHUS algorithm to behave like a Metropolis-Hastings algorithm with target measure . The rigorous result for the convergence is given in Section 3, and the efficiency of SHUS is discussed in Section 4.
2.3 Reformulation as a Wang-Landau algorithm with a stochastic stepsize sequence
One key observation of this work is that SHUS can be seen as a Wang-Landau algorithm with nonlinear update of the weights and with a specific stepsize sequence , see wang-landau-01 ; wang-landau-01-PRL ; fort:jourdain:kuhn:lelievre:stoltz:2014 ; amrx . The Wang-Landau algorithm with nonlinear update of the weights consists in replacing the updating formula (10) by:
where the deterministic stepsize sequence has to be chosen by the practitioner beforehand. The choice of this sequence is not easy: it should converge to zero when goes to infinity (vanishing adaption) in order to ensure the convergence of the sequence , but not too fast otherwise the convergence of to is not ensured.
In fact, the updating rule of the original Wang-Landau algorithm is more complicated than (11) since the stepsizes are changed at random stopping times related to a quasi-uniform population of the strata and not at every iteration, see JR11 for a mathematical analysis of the well-posedness of the algorithm.
Going back to SHUS, by setting
it is easy to check that (10) is equivalent to
which explains why SHUS can be seen as a Wang-Landau algorithm with nonlinear update of the weights (see (11)) but with a stepsize sequence which is not chosen by the practitioner: it is adaptively built by the algorithm.
3 Convergence results
The goal of this section is to establish convergence results on the sequence given by (9) to the weight vector and on the distribution of the samples . To do so, we will extend the results of fort:jourdain:kuhn:lelievre:stoltz:2014 and more generally prove convergence of the Wang-Landau algorithm with general (either deterministic or stochastic) stepsize sequence and the nonlinear update of the weights. We need the following assumptions on the target density and the kernels :
The density of the target distribution is such that and the strata satisfy .
Notice that this assumption implies that given by (1) is such that (hence ).
For any , is a Metropolis-Hastings transition kernel with proposal kernel where is symmetric and satisfies , and with invariant distribution , where is given by (4).
3.1 Convergence of Wang-Landau algorithms with a general stepsize sequence
In fort:jourdain:kuhn:lelievre:stoltz:2014 , we consider the Wang-Landau algorithm with a linear update of the weights. The linear update version of Wang-Landau consists in changing the updating rule (11) to a linearized version (in the limit ) on the normalized weights:
We prove in fort:jourdain:kuhn:lelievre:stoltz:2014 that, when the target density and the kernels satisfy A1 and A2, the Wang-Landau algorithm with this linear update of the weights converges under the following condition on : the sequence is deterministic, ultimately non-increasing,
To the best of knowledge, neither the convergence of the Wang-Landau algorithm with the nonlinear update of the weights (11) nor the case of a random sequence of stepsizes are addressed in the literature. It is a particular case of the following Wang-Landau algorithm with general stepsize sequence which also generalizes SHUS: starting from random variables and , iterate the following steps:
compute the probability measure on ,
draw according to the kernel ,
compute, for all ,
where the positive stepsize sequence is supposed to be predictable with respect to the filtration (i.e. is -measurable).
Algorithm 2 is a meta-algorithm: to obtain a practical algorithm, one has to specify the way the stepsize sequence is generated.
An algorithm of the type described in Algorithm 2 is said to converge if it satisfies the following properties:
For any bounded measurable function on ,
For any bounded measurable function on ,
In addition to the almost sure convergence of to , this definition encompasses the ergodic convergence of the random sequence to and the convergence of an importance sampling-type Monte Carlo average to the probability measure . Our main result is the following Theorem.
One immediately deduces convergence of the Wang-Landau algorithm with determistic stepsize sequence and nonlinear update (11) under the same assumptions as for the linear update (14), which generalizes the result of fort:jourdain:kuhn:lelievre:stoltz:2014 .
Before outlining the proof of Theorem 3.1, let us discuss its application to the SHUS algorithm.
3.2 Convergence of SHUS
The stepsize sequence obtained in the reformulation of the SHUS algorithm 1 as a Wang-Landau algorithm is clearly decreasing since, the sequence is increasing. To apply Theorem 3.1, we also need to check (18). This is the purpose of the following proposition which is proved in Section 6.
Since the sequence generated by the SHUS algorithm is a Markov chain, one easily deduces that the SHUS algorithm started from a random initial condition also converges in the sense of Definition 1.
The convergence result from Theorem 3.2 allows us to characterize the asymptotic behavior of the stepsizes. Indeed, since for and ,
the property (17) with implies that the SHUS algorithm generates sequences and which satisfy
Notice that this Corollary implies that the stepsize sequence scales like in the large limit. In Section 4.3, we will therefore compare SHUS with the Wang-Landau algorithm implemented with a stepsize sequence , for some positive parameter .
3.3 Strategy of the proof of Theorem 3.1
The proof of Theorem 3.1 is given in Section 6. It relies on a rewriting of the updating mechanism of the sequence as a Stochastic Approximation (SA) algorithm for which convergence results have been proven (see for example andrieu:moulines:priouret:2005 ). Notice that the updating formula (15)–(16) is equivalent to: for all and
where for all ,
denotes the index of the stratum where lies. Upon noting that , (19) is equivalent to
where is defined by
Notice that the last term in (20) is of the order of . The recurrence relation (20) is thus in a standard form to apply convergence results for SA algorithms (see e.g. andrieu:moulines:priouret:2005 ).
There are however three specific points in Algorithm 2 which make the study of this SA recursion quite technical. A first difficulty raises from the fact that alone is not a Markov chain: given the past up to time , is generated according to a Markov transition kernel computed at the current position but controlled by the current value , which depends on the whole trajectory . A second one comes from the randomness of the stepsizes . Finally, it is not clear whether the sequence remains a.s. in a compact subset of the open set so that a preliminary step when proving the convergence of the SA recursion is to establish its recurrence, namely the fact that the sequence returns to a compact set of infinitely often.
3.4 A few crucial intermediate results
Let us highlight a few results which are crucial to tackle these difficulties and establish Theorem 3.1. The fundamental result to address the dynamics of controlled by is the following proposition established in (fort:jourdain:kuhn:lelievre:stoltz:2014, , Proposition 3.1.)
In the present case, the recurrence property of the SA algorithm means the existence of a positive threshold such that infinitely often in , the minimal weight
is larger than the threshold. Let
be the smallest index of stratum with smallest weight according to and be the times of return to the stratum of smallest weight: and, for ,
with the convention . We prove in Section 6 the following recurrence property.
Using the recurrence property of Proposition 3, we are then able to prove that Algorithm 2 converges in the sense of Definition 1- by using general convergence results for SA algorithms given in andrieu:moulines:priouret:2005 . The properties in Definition 1-- then follow from convergence results for adaptive Markov Chain Monte Carlo algorithms given in fort:moulines:priouret:2011 . See Section 6 for the details.
4 Numerical investigation of the efficiency
We present in this section some numerical results illustrating the efficiency of SHUS in terms of exit times from a metastable state. We also compare the performances of SHUS and of the Wang-Landau algorithm on this specific example.
4.1 Presentation of the model and of the dynamics
We consider the system based on the two-dimensional potential suggested in PSLS03 , see also amrx for similar experiments on the Wang-Landau algorithm. The state space is (with ), and we denote by a generic element of . The reference measure is the Lebesgue measure . The density of the target measure reads
for some positive inverse temperature , with
and the normalization constant
We introduce strata , with and . Thus, the element lies in the stratum .
A plot of the level sets of the potential is presented in Figure 1. The global minima of the potential are located at the points and (notice that the potential is symmetric with respect to the -axis).
The Metropolis-Hastings kernels are constructed using isotropic proposal moves distributed in each direction according to independent Gaussian random variables with variance . The reference Metropolis-Hastings dynamics is ergodic and reversible with respect to the measure with density . This dynamics is metastable: for local moves ( of the order of a fraction of ), it takes a lot of time to go from the left to the right, or equivalently from the right to the left. More precisely, there are two main metastable states: one located around , and another one around . These two states are separated by a region of low probability. The metastability of the dynamics increases with (i.e. as the temperature decreases). The larger is, the larger is the ratio between the weight under of the strata located near the main metastable states and the weight under of the transition region around , and the more difficult it is to leave the left metastable state to enter the one on the right (and conversely). We refer for example to (amrx, , Fig. 3-1) for a numerical quantification of this statement.
As already pointed out in amrx , adaptive algorithms such as the Wang-Landau dynamics are less metastable than the original Metropolis-Hastings dynamics, in the sense that the typical time to leave a metastable state is much smaller thanks to the adaption mechanism. In this section, we compare the adaptive Markov chain corresponding to the SHUS algorithm with the one generated by the Wang-Landau dynamics with a nonlinear update of the weights (see (11)) with stepsizes for some constant . This choice for is motivated by the asymptotic behavior of in the limit , see Corollary 2. The proposal kernel used in the Metropolis algorithm is the same for SHUS and Wang-Landau. Therefore, the two algorithms only differ by the update rules of the weight sequence . The initial weight vector is and the initial condition is for both dynamics.
4.2 Study of a typical realization
Let us first consider a typical realization of the SHUS algorithm, in the case when is equal to the width of a stratum , with (so that ), and an inverse temperature . The values of the first component of the chain as a function of the iterations index are reported in Figure 2. The trajectory is qualitatively very similar to the trajectories obtained with the Wang-Landau algorithm (see for example (amrx, , Figure 5)). In particular, the exit time out of the second visited well is much larger than the exit time out of the first one. Such a behavior is proved for the Wang-Landau algorithm with a deterministic stepsize in (amrx, , Section 6). After the exit out of the second well, the convergence of the sequence is already almost achieved, and the dynamics freely moves between the two wells.
When the initial exit out of the first well occurs, the biases within this well are already very well converged, see Figure 3.
In the longtime limit, according to Figure 4 (top), one has , as predicted by Corollary 2. In the initial phase (before the exit out of the left well), stabilizes around a value corresponding to the number of strata explored within the well, see Figure 4 (bottom) (here, the exploration is well performed for values of between -1.2 and -0.45, which corresponds to 15 strata). In this phase, only the restriction of the target density to these strata is seen by the algorithm and this convergence can be seen as a local version of Corollary 2.
4.3 Scalings of the first exit time
In this section, we study the influence of the three parameters , and on the first exit times (in the limit of small temperature). Concerning , we stick to the assumption that without any prior knowledge on the system, the choice is natural. In addition, notice that multiplying and the parameter by the same constant does not modify the sequence generated by the SHUS algorithm. This is why we always choose .
Average exit times are obtained by performing independent realizations of the following procedure: initialize the system in the state , and run the dynamics until the first time index such that (i.e. the first component of is larger than 1) for SHUS or for Wang-Landau. The average of this first exit time is denoted by for SHUS and by for Wang-Landau. For a given value of the inverse temperature , the computed average exit times and are obtained by averaging over independent realizations of the process started at . We use the Mersenne-Twister random number generator as implemented in the GSL library. Since we work with a fixed maximal computational time (of about a week or two on our computing machines with our implementation of the code), turns out to be of the order of a few thousands for the largest exit times, while in the easiest cases corresponding to the shortest exit times. In our numerical results, we checked that is always sufficiently large so that the relative error on and is less than a few percents in the worst cases.
According to the numerical experiments performed in amrx that confirm the theoretical analysis of a simple three-states model also given in amrx , the scaling behavior for in the limit for the Wang-Landau algorithm with a stepsize sequence is
where and are positive constants which depend on , and .
Due to the initial convergence of to the number of strata visited before the first exit time, one expects the first exit time of the SHUS algorithm to behave like the first exit time for the Wang-Landau algorithm with stepsizes . Note that since we use strata of equal sizes, the number is proportional to the total number of strata (see Table 4 for a more quantitative assessment).
We consider three situations:
We observe in all cases that the average first exit time is of the form
in the limit of large , the values and being obtained by a least-square fit in log-log scale. In view of (29), this confirms the validity of the above comparison with the Wang-Landau algorithm. The exponential rate does not seem to depend on the values of the parameters , and . Only the prefactors depend on these parameters. The larger the number of strata, and the lower the value of , the larger the prefactor is. A more quantitative assessment of the increase of the prefactor with respect to larger numbers of strata and smaller values of is provided in the captions of Table 2 and 3.