Free Energy Methods for Bayesian Inference: Efficient Exploration of Univariate Gaussian Mixture Posteriors

# Free Energy Methods for Bayesian Inference: Efficient Exploration of Univariate Gaussian Mixture Posteriors

Nicolas Chopin111Corresponding author: nicolas.chopin@ensae.fr , Tony Lelièvre and Gabriel Stoltz
1: ENSAE-CREST, 3, Avenue Pierre Larousse, 92245 Malakoff, France.
2: Université Paris Est, CERMICS, Projet MICMAC Ecole des Ponts ParisTech - INRIA
6 & 8 Av. Pascal, 77455 Marne-la-Vallée Cedex 2, France.
###### Abstract

Because of their multimodality, mixture posterior distributions are difficult to sample with standard Markov chain Monte Carlo (MCMC) methods. We propose a strategy to enhance the sampling of MCMC in this context, using a biasing procedure which originates from computational Statistical Physics. The principle is first to choose a “reaction coordinate”, that is, a “direction” in which the target distribution is multimodal. In a second step, the marginal log-density of the reaction coordinate with respect to the posterior distribution is estimated; minus this quantity is called “free energy” in the computational Statistical Physics literature. To this end, we use adaptive biasing Markov chain algorithms which adapt their targeted invariant distribution on the fly, in order to overcome sampling barriers along the chosen reaction coordinate. Finally, we perform an importance sampling step in order to remove the bias and recover the true posterior. The efficiency factor of the importance sampling step can easily be estimated a priori once the bias is known, and appears to be rather large for the test cases we considered.

A crucial point is the choice of the reaction coordinate. One standard choice (used for example in the classical Wang-Landau algorithm) is minus the log-posterior density. We discuss other choices. We show in particular that the hyper-parameter that determines the order of magnitude of the variance of each component is both a convenient and an efficient reaction coordinate.

We also show how to adapt the method to compute the evidence (marginal likelihood) of a mixture model. We illustrate our approach by analyzing two real data sets.

## 1 Introduction

Mixture modeling is presumably the most popular and the most flexible way to model heterogeneous data; see e.g. Titterington et al. (1986), McLachlan and Peel (2000) and Frühwirth-Schnatter (2006) for an overview of applications of mixture models. Due to the emergence of Markov chain Monte Carlo (MCMC), interest in the Bayesian analysis of such models has sharply increased in recent years, starting with Diebolt and Robert (1994). However, MCMC analysis of mixtures poses several problems (Celeux et al., 2000; Jasra et al., 2005). In this paper, we focus on the difficulties arising from the multimodality of the posterior distribution.

For the sake of exposition, we concentrate our discussion on univariate Gaussian mixtures, but we explain in the conclusion (Section 6) how our ideas may be extended to other mixture models. The data vector contains independent and identically distributed (i.i.d.) real random variables with density

 p(y|θ)=n∏i=1p(yi|θ),p(yi|θ)=K∑k=1qkφ(yi;μk,λ−1k), (1)

where the vector contains all the unknown parameters, i.e. the mixture weights , the means the precisions , and possibly hyper-parameters, and denotes the Gaussian density with mean and variance .

This model is not identifiable, because both the likelihood and the posterior density are invariant by permutation of components, provided the prior is symmetric. This is the root of the aforementioned problems. By construction, any local mode of the posterior density admits symmetric replicates, while a typical MCMC sampler recovers only one of these copies. A possible remedy is to introduce steps that permute randomly the components (Frühwirth-Schnatter, 2001). However, mixture posterior distributions are often “genuinely multimodal”, following the terminology of Jasra et al. (2005): The number of sets of equivalent modes is often larger than one; see also Marin and Robert (2007, Chap. 6) for an example of a multimodal posterior distribution generated by an identifiable mixture model, and Section 2.2 of this paper for an example based on real data. Therefore, and following Celeux et al. (2000) and Jasra et al. (2005), we consider that a minimum requirement of convergence for a MCMC sampler is to visit all possible labelling of the parameter, without resorting to random permutations.

Inspired by techniques used in molecular dynamics (Lelièvre et al., 2010), our aim is to develop samplers that meet this requirement, using an importance sampling strategy where the importance sampling function is the marginal distribution of a well chosen variable. The principle of the method is (i) to choose a “reaction coordinate”, namely a low-dimensional function of the parameters ; (ii) to compute the marginal log-density of this reaction coordinate (minus this log-density is called the “free energy” in the molecular dynamics literature); and (iii) to use the free energy to build a bias for the target of the MCMC sampler, in order to move more freely between the different modal regions of the initial target distribution. More precisely, the biased log-density is obtained by adding the free energy to the target log-density; this changes the marginal distribution of the reaction coordinate into a uniform distribution (within chosen bounds). At the final stage of the algorithm, the bias can be removed by performing a simple importance sampling step from the biased posterior distribution to the original unbiased posterior distribution. Expectations with respect to the posterior distribution may be computed from the weighted sample. We also derive a method for computing the evidence (marginal likelihood).

If the reaction coordinate is well chosen, it is likely that a MCMC sampler targeted at the biased posterior converges to equilibrium much faster than a standard MCMC sampler, see Lelièvre et al. (2008). Two questions are then in order: How to choose the reaction coordinate? And how to compute the free energy?

Concerning the choice of the reaction coordinate, the application of free energy based methods to mixture models is not straightforward. In many cases, samplers targeting a mixture posterior distribution are metastable: This term means that the trajectory generated by the algorithm remains stuck in the vicinity of some local attraction point for very long times, before moving to a different region of the accessible space where it again remains stuck. We show that the degree of metastability of a sampler targeting a mixture posterior distribution is strongly determined by certain hyper-parameters, typically those that calibrate in the prior the spread of each Gaussian component. We show that such hyper-parameters are good reaction coordinates, in the sense that (i) it is possible to efficiently compute the associated free energy (by adaptive algorithms, see below), (ii) the corresponding free energy biased dynamics explores quickly the (biased) posterior distribution, and (iii) the points sampled from the biased distribution have non-negligible importance weights with respect to the original target posterior distribution. Other reaction coordinates are discussed, such as the posterior log-density, which is also a good reaction coordinate (with the problem however that an appropriate range of variation for this reaction coordinate, which is needed for the method, is difficult to determine a priori). This reaction coordinate is the standard choice for the Wang-Landau algorithm, see for instance Atchadé and Liu (2010), Liang (2005) and Liang (2010) for related works in Statistics.

To compute the free energy, we resort to adaptive biasing algorithms (Darve and Pohorille, 2001; Hénin and Chipot, 2004; Marsili et al., 2006; Lelièvre et al., 2007). In contrast with the adaptive MCMC algorithms usually considered in the statistical literature (see the review of Andrieu and Thoms (2008) and references therein), these adaptive algorithms modify sequentially the targeted invariant distribution of the Markov chain, instead of the parameters of the Markov kernel. Such algorithms were initially designed to compute the free energy in molecular dynamics; see also Lelièvre et al. (2010) for a recent review of alternative standard techniques in molecular dynamics for computing the free energy, such as e.g. thermodynamic integration.

It is of course possible to combine our approach with other strategies aimed at overcoming multimodality, such as tempering methods; see e.g. Iba (2001); Neal (2001) and Celeux et al. (2000).

The paper is organized as follows. Section 2 presents the univariate Gaussian mixture model, and the difficulties encountered with classical MCMC algorithms. Section 3 describes the method we propose, which is based on free energy biased dynamics. Section 4 explains how to apply this method to Bayesian inference on Gaussian mixture models. Section 5 illustrates our approach with two real data-sets. Section 6 discusses further research directions, in particular how our approach may be extended to other Bayesian models.

## 2 Scientific context

### 2.1 Gaussian mixture posterior distribution

As explained in the introduction, we focus on the univariate Gaussian mixture model (1), associated with the following prior, taken from Richardson and Green (1997), which is symmetric with respect to the components :

 μk ∼ N(m,κ−1), λk ∼ Gamma(α,β), β ∼ Gamma(g,h), (q1,…,qK−1) ∼ DirichletK(1,…,1).

In our examples, we take , , , , , where and are respectively the range and the mean of the observed data, as in Jasra et al. (2005). The posterior density reads

 p(θ|y) = 1ZKp(θ)p(y|θ) = κK/2ghβKα+g−1ZKΓ(α)KΓ(g)(2π)n+K2(K∏k=1λk)α−1exp{−κ2K∑k=1(μk−M)2−β(h+K∑k=1λk)} ×n∏i=1[K∑k=1qkλ1/2kexp{−λk2(yi−μk)2}].

In this expression, is the vector

 θ=(q1,…,qK−1,μ1,…,μK,λ1,…,λK,β)∈Ω=SK×RK×(R+)K+1, (3)

the set

 SK={(q1,…,qK−1)∈(R+)K−1,  K−1∑i=0qi≤1}

is the probability simplex, and

 ZK=∫Ωp(θ)p(y|θ)dθ (4)

is the normalizing constant (namely the marginal likelihood in ), which depends on  and .

The sampling of the posterior distribution with density (2.1) is the focus of the paper.

### 2.2 Metastability in Statistical Physics

In this section, we draw a parallel between the problem of sampling (2.1) and the problem of sampling Boltzmann-Gibbs probability measures that arise in computational Statistical Physics; see for instance (Balian, 2007). We take this opportunity to introduce some of the concepts and terms of computational Statistical Physics that are relevant in our context.

In computational Statistical Physics, one is often interested in computing average thermodynamic properties of the system under consideration. This requires sampling a Boltzmann-Gibbs (probability) measure

 π(θ)=1Zexp{−V(θ)},Z=∫Ωexp{−V(θ)}dθ, (5)

where , and is the potential of the system, assumed to be such that . From now on, the term potential refers to the logarithm of a given (possibly unnormalized) probability density: e.g. for the posterior density (2.1).

Probability densities such as the posterior density (2.1) for mixture models, or the Boltzmann-Gibbs density (5) for systems in Statistical Physics, are often multimodal. Standard sampling strategies, for instance the Hastings-Metropolis algorithm (Metropolis et al., 1953; Hastings, 1970) generate samples which typically remain stuck in a small region around a local maximum (also called local mode) of the sampled distribution (or, equivalently, a local minimum of the potential ). The sequences of samples generated by these algorithms are said to be metastable.

Figure 1 illustrates this phenomenon, in the Bayesian mixture context, for the posterior (2.1), with components, and for two datasets (Fishery data, see Section 5.1, and Hidalgo stamps data, see Section 5.2). The posterior density admits at least equivalent modes, but very few mode switchings (if any) are observed in the MCMC trajectories. A simple Gaussian random walk Hastings-Metropolis is used, but we obtained the same type of plots (not shown) with Gibbs sampling.

The top right plot of Figure 1 also represents a Gaussian mixture probability density (in dashed line) which corresponds to a negligible (in terms of posterior probability) local mode of the posterior density for the Hidalgo dataset. In numerical tests not reported here, when this local mode is used as a starting point, the Hastings-Metropolis sampler used needs about iterations to leave the attraction of this meaningless mode. It is easy to make this problem worse by slightly changing the data. For instance, is increased to by adding 2 to the three largest ’s (while this local mode remained of very small posterior probability). This local mode illustrates the typical “genuine multimodality” of mixture posteriors mentioned in the introduction – multimodality which cannot be cured by label permutation.

## 3 Free-energy biased sampling

Consider a generic probability density , with as defined in (5). The principle of free energy biased sampling (described more precisely in Section 3.1) is to change the original density to the biased density:

 πA(θ)∝π(θ)exp{(A∘ξ)(θ)},

where is some real-valued function

 ξ:Ω→[zmin,zmax]

where is a bounded interval, , and the so-called free energy  (see definition (6) below) is such that the distribution of when is distributed according to is uniform over the interval . By sampling rather than , the aim is to remove the metastability in the direction of . Averages with respect to the original distribution of interest are finally obtained by standard importance sampling (see Section 3.3). We assume first that takes values in a bounded interval , (think of and for the mixture posterior distribution case), and postpone the discussion of how to treat reaction coordinates with values in an unbounded domain to Section 3.4.

An important part of the algorithm is to compute (an approximation of) the free energy . There are many ways to this end. We describe a class of methods which are very efficient in the field of computational Statistical Physics (see Section 3.2) and which, to our knowledge, have not been used so far in Statistics.

### 3.1 Principle of the method

Consider the conditional probability measures associated with :

 πξ(dθ∣ξ(θ)=z)=exp{−V(θ)}δξ(θ)−z(dθ)∫Σ(z)exp{−V(θ′)}δξ(θ′)−z(dθ′),

where is a measure with support

 Σ(z)={θ∈Ω ∣∣ ξ(θ)=z},

defined by the formula: for all smooth test functions  and ,

 ∫Ωψ(ξ(θ))φ(θ)dθ=∫ψ(z)∫Σ(z)φ(θ)δξ(θ)−z(dθ)dz.

The main assumption underlying free-energy biased methods is that the function is chosen so that the sampling of is significantly “easier” than the sampling of , at least for some values of . In other words, should be much less multimodal than , at least for some values of , see the discussion in Section 4.2 below.

To give a concrete example, consider the choice . In this case, is the conditional posterior distribution of all variables except , conditionally on .

The function is called a reaction coordinate in the physics literature, because of its physical interpretation: this function parameterizes the progress of some chemical event at a coarser scale (chemical reaction or change of conformation for example). Given the trajectory of a Markov chain, is typically a metastable trajectory, and varies on timescales much larger than the typical microscopic fluctuations of the system around its metastable configurations. Of course, in our Bayesian mixture context, this physical interpretation is not very useful. For the moment, we stick with the more generic (and informal) understanding mentioned at the beginning of this paragraph, i.e. the sampling of should be “easier” than the sampling of . We defer the important discussion on how to interpret and choose this “reaction coordinate” in our specific context to Section 4.2. We also refer the readers to Lelièvre et al. (2008); Lelièvre and Minoukadeh (2011) for a precise quantification of this concept in a functional analysis framework. Finally, although we consider only one-dimensional reaction coordinates in this paper, we mention that extensions to reaction coordinates with values in with are possible (Lelièvre et al., 2010; Chipot and Lelièvre, 2010). Some algorithms allowing to switch between different reaction coordinates have also been developed (Piana and Laio, 2007).

The free energy is defined as

 exp{−A(z)}=∫Σ(z)exp{−V(θ)}δξ(θ)−z(dθ), (6)

see for instance Section 5.6 in Balian (2007). In other words, the free energy is minus the marginal log-density of the reaction coordinate. As above, let us again consider the simple example when the reaction coordinate is . Then, the free-energy is simply, up to an additive constant, equal to

 A(q1)=−log(∫SK−1(q1)×RK×(R+)K+1exp{−V(θ)}dq2…dqK−1dμ1…dμKdλ1…dλKdβ),

where

 SK−1(q1)={(q2,…,qK−1)∈(R+)K−2,  K−1∑i=2qi=1−q1}.

In words, the free energy is in this case the opposite of the log-marginal density of .

The free energy can be used to bias the target density as follows:

 πA(θ)=1ZAexp{−V(θ)+(A∘ξ)(θ)}.

We refer to densities of this form as free energy-biased densities. The essential property of is that, by construction, the corresponding marginal distribution of  is uniform on the interval . A sampler targeting is thus much less likely to be metastable (namely to get stuck around a local minimum of the density) than a sampler targeting , because (i) the former sampler should move freely along the direction defined by the reaction coordinate, since the marginal distribution of is uniform, and (ii) we have assumed that the reaction coordinate is such the conditional probability distributions are easy to sample, at least for some values of (namely that they do not have very separated modes).

Therefore, free energy-based methods aim at sampling , in order to move freely across the sampling space. Then, is eventually recovered through an importance sampling step, from to : for any test function ,

 (7)

We refer to Section 3.3 for further precisions. Note that for (7) to hold, only needs to be defined up to an additive constant.

### 3.2 Computing the free energy by adaptive methods

In most cases, the free energy  defined in (6) does not admit a closed-form expression, and must be estimated. There are nowadays many techniques to this end, with various degrees of efficiencies and conceptual complexities. We present in this section some powerful algorithms, namely adaptive biasing methods, which are not so well known in the statistical literature. Of course, any other standard method such as thermodynamic integration could be used (see the book (Lelièvre et al., 2010) for a precise presentation of standard methods for free energy computations in the framework of computational Statistical Physics, as well as Gelman and Meng (1998) for a review from the viewpoint of Statistics).

#### 3.2.1 General structure of adaptive methods

In adaptive biasing Markov chain Monte Carlo methods, a time-varying biasing potential is considered. The biasing potential is sequentially updated in order to converge to the free energy in the limit. As already mentioned in the introduction, the term “adaptive” refers in this paper to the dynamic adaptation of the targeted probability measure, and not of the parameters of a Markov kernel used in the simulations. Specifically, at iteration , the time-varying targeted density is

 πAt(θ)=1ZAtexp{−V(θ)+(At∘ξ)(θ)}. (8)

An adaptive MCMC algorithm simulates a non-homogeneous Markov chain , , using the two following steps at iteration :

1. a MCMC move according to the current target defined in (8),

 θt∼Kt(θt−1,⋅),

where is a Markov kernel leaving invariant;

2. the update of the bias to , using a trajectory average, see Section 3.2.2 below.

The first step may be done using a Hastings-Metropolis kernel for instance, see Figure 2.

Before explaining the second step, let us mention how the discretization of the reaction coordinate values for the biasing potential is done in practice. A simple strategy, which we adopt in this paper, is to use predefined bins, and approximate the biasing potential or its derivative (with respect to ) by piecewise constant functions. Specifically, we consider bins of equal sizes ,

 [zmin,zmax]=Nz−1⋃i=0[zi,zi+1],zi=zmin+iΔz,Δz=zmax−zminNz.

Other discretizations may of course be used, but this is not the focus of this paper.

#### 3.2.2 Strategies for updating the bias

Recall that the bottom line of adaptive methods is that should converge to . Adaptive biasing methods can be classified into two categories, depending on whether it is the free energy , or its derivative with respect to , which is updated. Instances of the first strategy, called adaptive biasing potential (ABP) methods, include nonequilibrium metadynamics (Bussi et al., 2006; Raiteri et al., 2006), the Wang-Landau algorithm (Wang and Landau, 2001b, a) and Self-Healing Umbrella Sampling (Marsili et al., 2006). The adaptive biasing force (ABF) methodology (Darve and Pohorille, 2001; Hénin and Chipot, 2004; Lelièvre et al., 2007), which is the main adaptive method used in this paper, is an instance of the second. From now on, we focus on two particular strategies, one belonging to the ABP class, and another to ABF class.

The ABP strategy we choose is based on Marsili et al. (2006). In particular, we do not use the Wang-Landau algorithm, which is, to our knowledge, the only ABP method discussed before in the Statistical literature; see e.g. Atchadé and Liu (2010), Liang (2005) and Liang (2010). Indeed, a delicate point with the Wang-Landau approach is how to choose the vanishing rate of the “gain factor”; see e.g. Liang (2005). On the other hand, Self-Healing Umbrella Sampling does not involve such an additional parameter to be tuned. It consists in updating as follows. The biasing potential for is initially set to (for all ), and then updated for all and for as

 ∀z∈(zi,zi+1),exp{−At(z)}=1Zt(1+t−1∑j=11{zi≤ξ(θj)

the normalization factor being such that

 ΔzNz−1∑i=0exp{−At(zi+zi+12)}=1.

The method may be understood as follows: If was indeed distributed according to at all times , then the weight , proportional to , would correct for the bias introduced at iteration , and would indeed be an estimator of the probability that when is distributed according to the unbiased density , that is . Of course, since is varying in time, it is not exactly true that is distributed according to , but this reasoning yields at least an intuition on the way the method is built. It is easy to check that, provided the method converges, the only possible limit for the biasing potential is the free energy  (up to the discretization error introduced by the binning of the reaction coordinate values). Generalizations of the update equation (9) leading to higher efficiencies have recently been proposed in (Dickson et al., 2010).

Alternatively, the ABF strategy (Darve and Pohorille, 2001; Hénin and Chipot, 2004; Lelièvre et al., 2007) is based on the following formula for the derivative of  (called the mean force):

 A′(z)=F(z)=Eπ(f(θ)∣∣ξ(θ)=z), (10)

where admits an analytic expression in terms of and :

 f=∇V⋅∇ξ|∇ξ|2−div(∇ξ|∇ξ|2), (11)

where is the gradient operator, and is the divergence operator. As shown in (Lelièvre et al., 2010), formulas (10)-(11) may be derived from the definition (6) of the free energy using the co-area formula (Evans and Gariepy, 1992; Ambrosio et al., 2000). In the simple case , it is easy to prove that (10)-(11) hold, and that . As mentioned above, the conditional measure of with respect to is the same as the conditional measure of with respect to . Thus, a natural ABF updating strategy is to compute at iteration the following approximation of the mean force: for all and for ,

 ∀z∈(zi,zi+1),Ft(z)=t−1∑j=1f(θj)1{zi≤ξ(θj)≤zi+1}t−1∑j=11{zi≤ξ(θj)≤zi+1}. (12)

From this approximation of , an approximation of the free energy  can be recovered by integrating  in . The consistency of the method may be understood as follows: If was distributed according to at all times , then we would have and hence (up to an additive constant). Besides, as above, it can be shown that, provided the method converges, the biasing potential converges to (a discretized version of) the free energy , up to an additive constant.

The interest of ABP compared to ABF is that it does not require computing given by (11), which may be cumbersome for some  such as (minus the log posterior density). On the other hand, it is observed that the ABF method yields very good results since the derivative of the free energy is approximated, so that after integration, the adaptive biasing potential is smoother in for ABF than for ABP. In the following, we use the ABF method, except when we consider as a reaction coordinate the potential . In this case, the ABP method is used.

The convergence of the adaptive biasing force method (for a slightly different dynamics), has been studied in (Lelièvre et al., 2008), and its associated discretization using many replica of the simulated Markov chain has been considered in Jourdain et al. (2010). For refinements concerning the implementation of such a strategy, we refer to Lelièvre et al. (2007).

#### 3.2.3 Practical implementation of adaptive algorithms

To summarize the method, we give in Figure 2 the details of the ABF algorithm. A similar algorithm is used in the ABP case. In practice, we stop the algorithm when the bias is no longer significantly modified from to , where is a fixed number of iterations between two convergence checks. See Section 5.1.1 for an illustration of this strategy.

The so-obtained bias is then considered as a good approximation of the free energy, and used in the subsequent importance sampling step, as explained in Section 3.3. A perfect convergence is not required, in the sense that the bias does not need to be estimated very accurately, as it is removed in the final importance sampling step. Moreover, note that the biasing potential  needs to be computed only up to an additive constant which does not play any role in the overall procedure.

### 3.3 Reweighting free-energy biased simulations

Upon stabilization of the adaptive algorithm at iteration , an estimate of the biasing potential is obtained, from which one defines the biased density

 ˜π(θ)=πˆA(θ)=1˜Zπ(θ)exp{ˆA∘ξ(θ)}. (13)

To sample the true posterior , we use the following simple strategy. We simulate a standard MCMC algorithm, e.g. a random walk Hastings-Metropolis algorithm, targeted at the biased posterior density , and then perform an importance sampling step from to , based on the importance sampling weights:

 w(θ)=exp{−ˆA∘ξ(θ)}∝π(θ)˜π(θ). (14)

From the MCMC chain targeted at , the expectation with respect to of a test function can thus be estimated as (see (7)):

 Eπ(h)=E˜π(hw)E˜π(w)≃tmax∑t=1h(θt)w(θt)tmax∑t=1w(θt), (15)

where is the number of iterations of the MCMC chain.

### 3.4 Reaction coordinates with unbounded values

There are many cases when the reaction coordinate takes values in an unbounded interval . Here should be understood as the support of the distribution of the random variable when . One may think of as an example for the mixture posterior distribution case (in which case ), and see Section 4.2 for more examples.

It is not possible to apply the above procedure on the whole interval . Some truncation is required for at least two reasons. First, numerically, it would be difficult to discretize in space a function defined on an unbounded domain. Second (and more importantly) the use of the full free energy over would lead to a density which is not integrable (since the uniform law over is not well defined as a probability distribution).

We therefore resort to the following strategy. First, we choose some truncation interval . Then, in the adaptive MCMC algorithm (which calculates the free energy), we reject any point such that fall outside of this interval. This is tantamount to restricting the sampling space with the constraint . In this way, one obtains an estimate of the free energy , but only for .

When is obtained, we simply extend its definition outside as follows: , for , , for . Finally, we run a standard MCMC sampler targeting the distribution , as described in the previous section. Note that the biased distribution is defined over the whole parameter space (in particular, no additional rejection step is needed in the sampling of this distribution).

In practice, one should choose an interval which is not too large, but at the same time such that the probability (with respect to ) of the event is close to one:

 ∫zmaxzminexp(−A(z))dz∫Iexp(−A(z))dz≃1, (16)

so that is barely visited (see also Section 4.1.2). This is one of the practical difficulties that we shall discuss in the next section.

## 4 Bayesian inference from free energy biased dynamics

In this section, we explain how to perform Bayesian inference for the univariate Gaussian mixture model described Section 2.1, that is, how to compute quantities such as posterior expectations and ratios of marginal likelihoods (equal to ratios of normalizing constants  defined in (4)), using the free energy associated to a given reaction coordinate to build an importance function. The Gaussian mixture model corresponds, in the notation of Section 3, to , hence . For a given reaction coordinate, and a given estimate of the free energy , the free-energy biased probability distribution is

 ˜p(θ|y)∝p(θ|y)/w(θ)∝p(θ|y)exp(ˆA∘ξ),

where is defined by (14).

We start by listing the criteria we use to assess the quality of the importance sampling procedure in Section 4.1. As mentioned in the introduction, the strategy to sample the posterior distribution (2.1) consists of three steps: choosing a reaction coordinate, computing (an approximation of) the free energy associated to this reaction coordinate, and using the free energy to build an importance sampling proposal distribution according to (13). The previous section was devoted to the second and third steps. We discuss the first step in Section 4.2 for the mixture model at hand. Section 4.3 presents an extension of the method to the computation of the ratio of normalizing constants associated to different values of the number of components , in order to perform model choices between models corresponding to different number of components. Note that we discuss in this section the theoretical efficiency of the whole approach. These discussions are supported by numerical experiments in Section 5.

### 4.1 Criteria for choosing the reaction coordinate

#### 4.1.1 General criteria

We consider the following criteria for evaluating the practical efficiency of the whole procedure, for a given choice of the reaction coordinate :

1. In the execution of the (either ABF or ABP) adaptive algorithm, how fast does the approximate free energy converge to its limit ?

2. How efficient is the importance sampling step from the biased distribution to the originally targeted posterior distribution? Actually, this criterion is twofold:

1. How efficient is the MCMC sampling of the biased density ?

2. How representative are the points simulated from the biased distribution with respect to the target posterior distribution? (i.e. how many of these points are assigned non-negligible importance weights?)

3. A more practical criterion is (in the case of a reaction coordinate with values in an unbounded domain): How difficult is it to determine, a priori, an interval for the reaction coordinate values, which ensures good performance with respect to Criteria (a) and (b) and which satisfies (16) ?

Criterion (b2) is discussed in the next section. Criteria (a) and (b1) can actually be shown to be closely related, at least for some family of adaptive methods, see Lelièvre et al. (2008). Roughly speaking, an adaptive algorithm yields quickly an estimate of the free energy, if and only if the free energy is indeed a good biasing potential, in the sense that the dynamics driven by the biased potential converges quickly to a limiting distribution. Theoretically and as mentioned above, a sufficient condition for an efficient sampling is that the conditional probability distributions are easy to sample, at least for some values of  (namely they do not have very separated modes). We refer to Lelièvre et al. (2008); Lelièvre and Minoukadeh (2011) for precise mathematical results.

Numerically, to assess the convergence of adaptive methods, we recommend the following two basic checks: (i) that the output of the adaptive algorithm has explored the full range and has a distribution which is close to uniform; and (ii) using the criterion mentioned in the introduction, and specifically for mixture posterior distributions, that the algorithm has visited the symmetric replicates of any significant local mode. The same convergence checks can be applied to the MCMC dynamics targeted at .

#### 4.1.2 Efficiency of the importance sampling step

We give here a way to quantify Criterion (b2). To evaluate the performance of the importance sampling step, we compute the following efficiency factor

 EF=(T∑t=1w(θt))2TT∑t=1w(θt)2

where is defined in (14), and where denotes the MCMC sample targeting the biased posterior , as described in Section 3.3. The efficiency factor is the Effective Sample Size of Kong et al. (1994) divided by the number of sampled values. This quantity lies in . It is close to one (resp. to zero) when the random variable has a small (resp. a large) variance. Indeed, it is easy to check that

 EF=(VarT(w)(ET(w))2+1)−1,VarT(w)=1TT∑t=1w(θt)2−[ET(w)]2,

where the latter quantity is the empirical variance of the sample , and its empirical average.

We now propose an estimate of the efficiency factor in terms of the converged bias only, which may therefore be computed before the MCMC algorithm targeting the biased posterior is run, and the importance sampling step is performed. This estimate is based on the fact that, with respect to , the marginal distribution of is approximately uniform over . For well chosen and , hardly visits values out of the interval (see (16) above) and thus

 VarT(w)(ET(w))2≃1zmax−zmin∫zmaxzmin(exp{−ˆA(z)}−1zmax−zmin∫zmaxzminexp{−ˆA})2dz(1zmax−zmin∫zmaxzminexp{−ˆA(z)}dz)2,

which provides a justification for the following “theoretical” efficiency factor:

 EFtheoretical=(∫zmaxzminexp{−ˆA(z)}dz)2(zmax−zmin)∫zmaxzminexp{−2ˆA(z)}dz. (17)

The agreement between theoretical and numerically computed efficiency factors in our simulations is very good, see Tables 1, 2 and 4 in Section 5.1.3. Thus, the theoretical efficiency factor allows for a quick check that the subsequent importance sampling is reasonably efficient.

From the expression (17), it is seen that the efficiency factor is close to one when is close to a constant. Thus, Criterion (b2) mentioned in the previous section is likely to be satisfied if the free energy associated to has a small amplitude, i.e. is as small as possible.

### 4.2 Practical choice of the reaction coordinate

We now discuss the practical choice of the reaction coordinate in the mixture posterior sampling context, with respect to the criteria listed above. We discuss successively the following four possible choices: , , and . This discussion is also illustrated numerically in Section 5.1.2.

The requirement that the multimodality of the target measure conditional on is much less noticeable than the multimodality of the original target measure rules out the choice of as a good reaction coordinate since, conditionally on , the posterior density still has at least modes, as the components to remain exchangeable. Numerical tests indeed support these considerations, see below.

A more natural reaction coordinate is minus the posterior log-density , in the spirit of the original Wang-Landau algorithm (Wang and Landau, 2001b, a). Indeed, exploring regions of low posterior density should help to escape more easily from local modes. Unfortunately, determining a range of “likely values” (with respect to the posterior distribution) for such functions of is not straightforward; see Criterion (c) above. Moreover, since the posterior density is expected to be multimodal and difficult to explore, there seems to be little point in performing MCMC pilot runs in order to determine . A conservative approach is to choose a very wide interval , but this makes the subsequent importance sampling step quite inefficient. In our simulations, we report satisfactory results for this reaction coordinate, but with the caveat that our choice for was facilitated by our different simulation exercises, based on several reaction coordinates. Another practical difficulty we observed in one case is that the estimated bias is quite inaccurate in the immediate vicinity of the posterior mode, because the free energy tends to increase very sharply in this region, see Section 5.2 for more details.

The choice is satisfactory with respect to Criterion (c): The range on which it can vary, namely , is clearly known. With respect to (a), this choice looks appealing as well, since forcing to get close to 1 should empty the other components, which then may swap more easily. Unfortunately, we observe in some of our experiments that the dynamics biased by the free-energy associated with this reaction coordinate is not very successful in terms of mode switchings, see Figures 6 and 13.

Finally, appears to be a good trade-off with respect to our criteria, at least in the examples we consider below. Concerning the determination of the interval (Criterion (c)), since  determines the order of magnitude of the component variances , there should be high posterior probability that is a small fraction of , where  is the range of the data. For instance, we obtain satisfactory results in all our experiments with . Concerning Criterion (a), we observe that the choice performs well (see the numerical results below).

We propose the following explanation. Since the ’s have a prior, large values of  penalize large values for the component precisions , or equivalently penalize small values for the component standard deviations . If is large enough, the Gaussian components are forced to cover the complete range of the data, and thus can switch easily. This interesting phenomenon is illustrated by Figure 8, see below for further precisions. In other words, a “good” reaction coordinate should be such that the conditional probability distributions are less multimodal than , at least for some values of . For a theoretical result supporting this interpretation, we refer to Lelièvre and Minoukadeh (2011).

### 4.3 Computing normalizing constants and model choice

In this section, we discuss an extension of the method to perform model choice between models with different numbers of components. The principle is to compute the normalizing constant of the posterior density for different values of , see (4) for a definition of . More precisely, it is sufficient to evaluate for a given range of (see (Robert and Casella, 2004, Chap. 7) for a review on Bayesian model choice).

We propose the following strategy. The estimation of can be performed by first estimating , then estimating , and finally dividing the two quantities. A simple estimator of (where is the normalizing constant in (13)) is given by

 ˆIK=1TT∑t=1w(θt),

where is a sample distributed according to the biased probability (with normal components). This formula is based on the fact that the expectation of with respect to is .

Let denote the parameter vector obtained by removing in the parameters attached to a given component , and replacing the probabilities (for ) by . Let denote the likelihood of the model with components, and parameter . Then the following quantity

 ˆIK−1=1KK∑k=1ˆIK−1,k,ˆIK−1,k=1TT∑t=1w−k(θt),

where is the same Markov chain as above, and

 w−k(θ)=p(y|θ−k)p(y|θ)exp{−ˆA∘ξ(θ)}, (18)

is an estimator of .

The estimators and are reminiscent of the birth and death moves of the reversible jump algorithm of Richardson and Green (1997), where a new model is proposed by adding or removing a component chosen at random. The difference is that the biased posterior acts as an intermediate state between the posterior with components, and the posterior with components (or more precisely, the posterior with components times the prior of a -th “non-acting” component, in order to match the dimensionality of both and ).

In our numerical experiments, the estimator of obtained from this strategy performs well, see Section 5 (in particular Table 3).

## 5 Numerical examples

In our experiments and as explained above, we use the following approach. First, we run an adaptive algorithm (ABF, except for , in which case we use ABP), for a given choice of the reaction coordinate , and a given interval , until a converged bias is obtained. Second, we run a MCMC algorithm, with given by (13) as invariant density. Third, we perform an importance sampling step from to , the unbiased posterior density. See the introduction of Section 4 for the notation.

The quality of the biasing procedure is assessed using the criteria mentioned in Sections 4.1. This consists in: (i) checking that the reaction coordinate values are uniformly sampled over , (ii) checking that the output is symmetric with respect to labellings, and many switchings between the modes are observed and (iii) computing the efficiency factor (a good indicator being the estimator (17) defined in terms of ).

In the first step of the method (approximation of the free energy using adaptive algorithm), we deliberately use the simplest exploration strategy, namely a Gaussian random walk Hastings-Metropolis update, with small scales (see below for the precise values). This is to illustrate that the ability of adaptive algorithms to approximate the free energy does not crucially depend on a finely tuned updating strategy.

In the second step, we run a Hastings-Metropolis algorithm targeted at the biased posterior, using Cauchy random walk proposals, and the following scales: , , , where is the range of the data, which leads to acceptance rates between and in all cases.

### 5.1 A first example : the Fishery data

We first consider the Fishery data of Titterington et al. (1986) (see also Frühwirth-Schnatter (2006)), which consist of the lengths of 256 snappers, and a Gaussian mixture model with components; see Figure 1 for a histogram.

#### 5.1.1 Convergence of the adaptive algorithms

In the adaptive algorithm, we use Gaussian random walk proposals with scales , , and . These parameters were also used to produce the unbiased trajectory in Figure 1. We illustrate here the convergence process in the case , using the ABF algorithm described in Section 3.2, with , and .

First, we plot on Figure 3 the trajectory of and for iterations. With the ABF algorithm, the values visited by cover the whole interval , and the applied bias enables a frequent switching of the modes (observed here on the parameters ). The trajectories for should be compared with the ones given on Figure 1, where no adaptive biasing force is applied (note that the -axis scale is not the same on both plots).

Second, we monitor the convergence of the biasing potential. To this end, we run a simulation for a total number of iterations , and store the biasing potential every iterations. The distance between the current bias and the bias at iteration is measured by

 δt= ⎷infc∈RNz−1∑i=0(At,i−At−Ncvg,i−c)2, (19)

where denotes the value of the bias in bin , i.e. if . Since the potential is defined only up to an additive constant, we consider the optimal shift constant which minimizes the mean-squared distance between the two profiles. An elementary computation shows that this constant is equal to the difference between the averages of and . We finally renormalize this distance as

 εt=δt√∑Nz−1i=0A2t,i.

The relative distance as a function of the iteration index is plotted in Figure 4. Correct approximations of the bias are obtained after a few multiples of iterations (the relative error being already lower than 0.1 at the first convergence check). We emphasize again that we did not optimize the proposal moves in order to reach the fastest convergence of the bias. It is very likely that better convergence results could be obtained by carefully tuning the parameters of the proposal function, or resorting to proposals of a different type.

On Figure 5, we plot the free energies associated to the four reaction coordinates mentioned above, as estimated by adaptive algorithms. We recall that, for , , and . For , we used an ABP algorithm, with , and . For and , we used ABF, with respectively , , and , , . Recall that the so-obtained bias is minus the marginal posterior log-density of . This is why the three important modes in the parameter can be read from the corresponding bias in Figure 5. Note also that there is a lower bound on the admissible values of minus the log-posterior density, hence the plateau value of the corresponding bias for low values of the reaction coordinate corresponding to unexplored regions.

#### 5.1.2 Efficiency of the biasing procedure

We now discuss the results of the MCMC algorithm targeted at the biased posterior distribution, using the free energies computed above. In Figure 6, we observe that all the biased dynamics are much more successful in terms of mode switchings than the unbiased dynamics (see Figure 1). More precisely, the dynamics biased by the free energy associated with is the most successful in terms of switchings, but the dynamics with performs correctly as well. The dynamics with seems to be less successful. In the case , one value of the parameters is forced to visit the whole range of values. The lowest mode (around ) is not very well visited here.

The efficiency factors are presented in Table 1. They are rather large, which shows that the importance sampling procedure does not yield a degenerate sample. The choice is the best one, but and give comparable and satisfactory results as well. The choice on the other hand is a poor choice in this case.

In view of these results, it seems that is the best choice, with the problem however that we had to slightly modify the bias for the lowest values of the reaction coordinate because of the too sharp variations of the bias in this region. (Our numerical experience is indeed that the bias obtained from minus the log-posterior density is sometimes difficult to use directly.) On the other hand, the procedure is more automatic for and , the latter reaction coordinate being a much better choice when it comes to mode switchings.

We now focus on . As explained in the introduction, a good sampler should visit all the possible labellings of the parameter space. This implies in particular that the marginal posterior distributions of the simulated component parameters should be nearly identical. This is clearly the case here, see the scatter plots of the 1-in- sub-sample of the simulated pairs , in Figure 7. The top left picture in Figure 7 also demonstrates that the biased dynamics indeed samples uniformly the values of over the chosen interval .

Finally, Figure 8 illustrates why the reaction coordinate allows for escaping from local modes; see the discussion in Section 4.2. Each plot represents a sub-sample of the simulated pairs (where the subscript denotes the iteration index while the subscript labels the components), restricted to being in intervals, from left to right, , and . Since the bias function depends only on , these plots are rough approximations of the posterior distribution conditional on