A Details of Proposed Algorithm

An Adaptive Interacting Wang-Landau Algorithm for Automatic Density Exploration

Abstract

While statisticians are well-accustomed to performing exploratory analysis in the modeling stage of an analysis, the notion of conducting preliminary general-purpose exploratory analysis in the Monte Carlo stage (or more generally, the model-fitting stage) of an analysis is an area which we feel deserves much further attention. Towards this aim, this paper proposes a general-purpose algorithm for automatic density exploration. The proposed exploration algorithm combines and expands upon components from various adaptive Markov chain Monte Carlo methods, with the Wang-Landau algorithm at its heart. Additionally, the algorithm is run on interacting parallel chains – a feature which both decreases computational cost as well as stabilizes the algorithm, improving its ability to explore the density. Performance of this new parallel adaptive Wang-Landau (PAWL) algorithm is studied in several applications. Through a Bayesian variable selection example, the authors demonstrate the convergence gains obtained with interacting chains. The ability of the algorithm’s adaptive proposal to induce mode-jumping is illustrated through a Bayesian mixture modeling application. Lastly, through a 2D Ising model, the authors demonstrate the ability of the algorithm to overcome the high correlations encountered in spatial models. The appendices contain the full algorithmic description in pseudo-code, a tri-modal toy example and remarks on the convergence of the proposed algorithm.

\setstretch

1.45

1 Introduction

As improvements in technology introduce measuring devices capable of capturing ever more complex real-world phenomena, the accompanying models used to understand such phenomena grow accordingly. While linear models under the assumption of Gaussian noise were the hallmark of early 20th century statistics, the past several decades have seen an explosion in statistical models which produce complex and high-dimensional density functions for which simple, analytical integration is impossible. This growth was largely fueled by renewed interest in Bayesian statistics accompanying the Markov chain Monte Carlo (MCMC) revolution in the 1990’s. With the computational power to explore the posterior distributions arising from Bayesian models, MCMC allowed practitioners to build models of increasing size and nonlinearity.

As a core component of many of the MCMC algorithms discussed later, we briefly recall the Metropolis-Hastings algorithm. With the goal of sampling from a density , the algorithm generates a Markov chain with invariant distribution . From a current state , a new state is sampled using a proposal density parametrized by . The proposed state is accepted as the new state of the chain with probability

and if it is rejected, the new state is set to the previous state . From this simple algorithmic description, it is straightforward to see that if is in a local mode and the proposal density has not been carefully chosen to propose samples from distant regions, the chain will become stuck in the current mode. This is due to the rejection of samples proposed outside the mode, underscoring the importance of ensuring is intelligently designed.

Though standard MCMC algorithms such as the Metropolis-Hastings algorithm and the Gibbs sampler have been studied thoroughly and the convergence to the target distribution is ensured under weak assumptions, many applications introduce distributions which cannot be sampled easily by these algorithms. Multiple reasons can lead to failure in practice, even if long-run convergence is guaranteed; the question then becomes whether or not the required number of iterations to accurately approximate the density is reasonable given the currently available computational power. Among these reasons, let us cite a few that will be illustrated in later examples: the probability density function might be highly multimodal, in which case the chain can get stuck in local modes. Alternatively or additionally, it might be defined on a high-dimensional state space with strong correlations between the components, in which case the proposal distributions (and in particular their large covariance matrices) are very difficult to tune manually. These issues lead to error and bias in the resulting inference, and may be detected through convergence monitoring techniques (see, e.g., Robert and Casella (2004)). However, even when convergence is monitored, it is possible that entire modes of the posterior are missed. To address these issues, we turn to a burgeoning class of Monte Carlo methods which we refer to as “exploratory algorithms.”

In the following section, we discuss the traits that allow exploratory MCMC algorithms to perform inference in multimodal, high-dimensional distributions, connecting these traits to existing exploratory algorithms in the process. In Section 3, we detail one of these, the Wang-Landau algorithm, and propose several novel improvements that make it more adaptive, hence easier to use, and also improve convergence. Section 4 applies the proposed algorithm to variable selection, mixture modeling and spatial imaging, before Section 5 concludes.

2 Exploratory Algorithms

As emphasized by Schmidler (2011), there are two distinct goals of existing adaptive algorithms. Firstly, algorithms which adapt the proposal according to past samples are largely exploitative, in that they improve sampling of features already seen. However, modes or features not yet seen by the sampler might be quite different from the previously explored region, and as such adaptation might prevent adequate exploration of alternate regions of the state space. As an attempted solution to this problem Craiu et al. (2009) suggest adapting regionally, with parallel chains used to perform the adaptation. Secondly, there exists a set of adaptive algorithms whose goal is to adapt in such a way as to encourage density exploration. These include, for instance, the equi-energy sampler (Kou et al., 2006), parallel tempering (Swendsen and Wang, 1986; Geyer, 1991), and the Wang-Landau (Wang and Landau, 2001a, b; Liang, 2005; Atchadé and Liu, 2010) algorithms among others. The algorithm developed here fits into the latter suite of tools, whose goal is to explore the target density, particularly distant and potentially unknown modes.

Although the aforementioned algorithms have proven efficient for specific challenging inference problems, they are not necessarily designed to be generic, and it is often difficult and time-consuming for practitioners to learn and code these algorithms merely to test a model. As such, while statisticians are accustomed to exploratory data analysis, we believe that there is room for generic exploratory Monte Carlo algorithms to learn the basic features of the distribution or model of interest, particularly the locations of modes and regions of high correlation. These generic algorithms would ideally be able to deal with discrete and continuous state spaces and any associated distribution of interest, and would require as few parameters to tune as possible, such that users can use them before embarking on time-consuming, tailor-made solutions designed to estimate expectations with high precision. In this way one may perform inference and compare between a wide range of models without building custom-purpose Monte Carlo methods for each.

We first describe various ideas that have been used to explore highly-multimodal densities, and then describe recent works aimed at automatically tuning algorithmic parameters of MCMC methods, making them able to handle various situations without requiring much case-specific work from the user.

2.1 Ability to Cross Low-Density Barriers

The fundamental problem of density exploration is settling into local modes, with an inability to cross low-density regions to find alternative modes. For densities which are highly multi-modal, or “rugged,” one can employ tempering strategies, sampling instead from a distribution proportional to with temperature . Through tempering, the peaks and valleys of are smoothed, allowing easier exploration. This is the fundamental idea behind parallel tempering, which employs multiple chains at different temperatures; samples are then swapped between chains, using highly tempered chains to assist in the exploration of the untempered chain (Geyer, 1991). Marinari and Parisi (1992) subsequently proposed simulated tempering, which dynamically moves a single chain up or down the temperature ladder. One may also fit tempering within a sequential Monte Carlo approach, whereby samples are first obtained from a highly tempered distribution; these samples are transitioned through a sequence of distributions converging to using importance sampling and moves with a Markov kernel (Neal, 2001; Del Moral et al., 2006). However, using tempering strategies with complex densities, one must be careful of phase transitions, where the density transforms considerably across a given temperature value.

A related class of algorithms works by partitioning the state space along the energy function . The idea of slicing, or partitioning, along the energy function is the hallmark of several auxiliary variable sampling methods, which iteratively sample then . This is the fundamental idea behind the Swendsen–Wang algorithm (Swendsen and Wang, 1987; Edwards and Sokal, 1988) and related algorithms (e.g. Besag and Green (1993); Higdon (1998); Neal (2003)). The equi-energy sampler (Kou et al., 2006; Baragatti et al., 2012), in contrast to the above auxiliary variable methods, begins by sampling from a highly tempered distribution; once convergence is reached, a new reduced-temperature chain is run with updates from a mixture of Metropolis moves and exchanges of the current state with the value of a previous chain in the same energy band. The process is continued until the temperature reaches and the invariant distribution of the chain is the target of interest. As such, this algorithm works through a sequence of tempered distributions, using previous distributions to create intelligent mode-jumping moves along an equal-energy set.

In a similar vein, the Wang-Landau algorithm (Wang and Landau, 2001a, b) also partitions the state space along a reaction coordinate , typically the energy function: , resulting in a partition . The algorithm generates a time-inhomogeneous Markov chain that admits an invariant distribution at iteration , instead of the target distribution itself as e.g. in a standard Metropolis-Hastings algorithm. The distribution is designed such that the generated chain equally visits the various regions as . Because the Wang-Landau algorithm lies at the heart of our proposed algorithm, it is extensively described in Section 3.

It is worth discussing a similar, recently proposed algorithm which combines Markov chain Monte Carlo and free energy biasing (Chopin et al., 2012) and its sequential Monte Carlo counterpart (Chopin and Jacob, 2010). The central idea of the latter is to explore a sequence of distributions, successively biasing according to a reaction coordinate in a similar manner. However, we have found the method to be largely dependent on selecting a well-chosen initial distribution , as is usually the case with sequential Monte Carlo methods for static inference. If the initial distribution is not chosen to be flatter than the target distribution, which is possibly the case since the regions of interest with respect to the target distribution are a priori unknown, the efficiency of the SMC methods relies mostly on the move steps within the particle filter, which are themselves Metropolis–Hastings or Gibbs moves.

2.2 Adaptive Proposal Mechanism

Concurrent with the increasing popularity of exploratory methods, the issue of adaptively fine-tuning MCMC algorithms has also seen considerable growth since the foundational paper of Haario et al. (2001), including a series of conferences dedicated solely to the problem (namely, Adap’ski 1 through 3 among others); see the reviews of Andrieu and Thoms (2008) and Atchadé et al. (2011) for more details. While the de-facto standard has historically been hand-tuning of MCMC algorithms, this new work finds interest in automated tuning, resulting in a new class of methods called adaptive MCMC.

The majority of the existing literature focuses on creating intelligent proposal distributions for an MCMC sampler. The principal idea is to exploit past samples to induce better moves across the state space by matching moments of the proposal and past samples, or by encouraging a particular acceptance rate of the sampler. The raison d’être of these algorithms is that tuning MCMC algorithms by hand is both time-consuming and prone to inaccuracies. By automating the selection of the algorithm’s parameters, practitioners might save considerable time in their analyses. This feature is pivotal in an automated density exploration algorithm. Due to its exploratory nature, it is likely that the practitioner might not have complete knowledge of even the scale of the density support; as a result, having a proposal distribution which adapts to the density at hand is a crucial step in the automation process.

One must be careful in selecting the type of adaptation mechanism employed to encourage exploration, rather than simply exploiting previously explored modes. For instance, tuning a proposal covariance to a previously visited mode might prevent the algorithm from reaching as yet unexplored modes in the direction of the current mode’s minor axis. Additionally, when combined with a progressively biased distribution as in the Wang-Landau algorithm, it is desirable to have a proposal which first samples what it sees well, then later grows in step size to better explore the flattened (biased) distribution.

3 Proposed Algorithm

We now develop our proposed algorithm. After recalling the Wang-Landau algorithm, which constitutes the core of our method, we describe three improvements: an adaptive binning strategy to automate the difficult task of partitioning the state space, the use of interacting parallel chains to improve the convergence speed and use of computational resources, and finally the use of adaptive proposal distributions to encourage exploration as well as to reduce the number of algorithmic parameters. We detail at the end of the section how to use the output of the algorithm, which we term parallel adaptive Wang-Landau (PAWL) to answer the statistical problem at hand.

3.1 The Wang-Landau Algorithm

As previously mentioned, the Wang-Landau algorithm generates a time-inhomogeneous Markov chain that admits a distribution as the invariant distribution at iteration . The biased distribution targeted by the algorithm at iteration is based on the target distribution , and modified such that a) the generated chain visits all the sets equally, that is the proportion of visits in each set is converging to when goes to infinity; and b) the restriction of the modified distribution to each set coincides with the restriction of the target distribution to this set, up to a multiplicative constant. The modification (a) is crucial, as inducing uniform exploration of the sets is the biasing mechanism which improves exploration; in fact similar strategies are used in other fields, including combinatorial optimization (Wei et al., 2004). Ideally the biased distribution would not depend on , and would be available analytically as:

(1)

where and is equal to if and 0 otherwise. Checking that using as the invariant distribution of a MCMC algorithm would validate points a) and b) is straightforward. Figure 1 illustrates a univariate target distribution and its corresponding biased distribution under two different partitions of the state space.

Figure 1: Probability density functions for a univariate distribution and its biased version when partitioning the state space along the -axis (, middle) and the log density (, right). The left-most plot also shows the partitioning of the state space with ; in both cases . The biasing is done such that the integral is the same for all (areas under the curve for each set) and such that and coincide on each set , up to a multiplicative constant.

In practical situations, however, the integrals are not available, hence we wish to plug estimates of into Equation (1). The Wang-Landau algorithm is an iterative algorithm which jointly generates a sequence of estimates for all and a Markov chain , such that when goes to infinity, converges to and consequently, the distribution of converges to . We denote by the biased distribution obtained by replacing by its estimate in Equation (1). Note that the normalizing constant of is now unknown. A simplified version of the Wang-Landau algorithm is given in Algorithm 1.

1:  Partition the state space into regions along a reaction coordinate .
2:  First, set .
3:  Choose a decreasing sequence , typically .
4:  Sample from an initial distribution .
5:  for  to  do
6:     Sample from , a transition kernel with invariant distribution .
7:     Update the bias: .
8:     Normalize the bias: .
9:  end for
Algorithm 1 Simplified Wang-Landau Algorithm

The rationale behind the update of the bias is that if the chain is in the set , the probability of remaining in should be reduced compared to the other sets through an increase in the associated bias . Therefore the chain is pushed towards the sets that have been visited less during the previous iterations, improving the exploration of the state space so long as the partition is well chosen. While this biasing mechanism adds cost to each iteration of the algorithm the tradeoff is improved exploration. In step , the transition kernel is typically a Metropolis-Hastings move, due to the lack of conjugacy brought about by biasing.

In this simplified form the Wang-Landau algorithm reduces to standard stochastic approximation, where the term decreases at each iteration. The algorithm as given in Wang and Landau (2001a, b) uses a more sophisticated learning rate which does not decrease deterministically, but instead only when a certain criterion is met. This criterion, referred to as the “flat histogram” criterion, is met when for all , is close enough to , where we denote by the proportion of visits of in the set since the last time the criterion was met. Hence we introduce a real number to control the distance between and , and an integer to count the number of criteria already met. We describe the generalized Wang-Landau in Algorithm 2.

1:  Partition the state space into regions along a reaction coordinate .
2:  First, set .
3:  Choose a decreasing sequence , typically .
4:  Sample from an initial distribution .
5:  for  to  do
6:     Sample from , a transition kernel with invariant distribution .
7:     Update the proportions: .
8:     if “flat histogram”:  then
9:        Set .
10:        Reset .
11:     end if
12:     Update the bias: .
13:     Normalize the bias: .
14:  end for
Algorithm 2 Wang-Landau Algorithm

When is set to low values (e.g. or ), the algorithm must explore the various regions such that the frequency of visits to the region is approximately before the learning rate is decreased. Also, the algorithm may be further generalized to target a desired frequency instead of the same frequency for every set; while such strategies may be useful, as demonstrated in the following section, for notational simplicity we focus on the case . As already mentioned, to answer the general question of exploring the support of a target density , the default choice for the reaction coordinate is the energy function: , which has the benefit of being one-dimensional regardless of the dimension of the state space . However, for specific models other reaction coordinates have been used, such as one (or more) of the components of or a linear combination of components of . In the applications in Section 4 we discuss the use of alternative reaction coordinates further.

We now propose improvements to the Wang-Landau algorithm to increase its flexibility and efficiency.

3.2 A Novel Adaptive Binning Strategy

The Wang-Landau and equi-energy sampler algorithms are known to perform well if the bins, or partitions of the one-dimensional reaction coordinate , are well chosen. However, depending on the problem it might be difficult to choose the bins to optimize sampler performance. A typical empirical approach to deal with this issue is to first run, for example, an adaptive MCMC algorithm to find at least one mode of the target distribution. The generated sample and the associated target density evaluations determine a first range of the target density values which can be used to initialize the bins. At this point the user can choose a wider range of target density values (e.g. by multiplying the range by 2), in order to allow for a wider exploration of the space. Within this initial range, one must still decide the number of bins.

Due to difficulties with selecting the bins, it has been suggested that one should adaptively compare adjacent bins, splitting a bin if the corresponding estimate is significantly larger than a neighboring value (Schmidler, 2011). Because each is a given bin’s normalizing constant, we feel it is more important to maintain uniformity within a bin to allow easy within-bin movement. Our proposed approach to achieve this “flatness” is to look at the distribution of the realized reaction coordinate values within each bin. Figure b illustrates this distribution on an artificial histogram. The plot of Figure a shows a situation where, within one bin, the distribution might be strongly skewed towards one side. In this artificial example, very few points have visited the left side of the bin, which suggests that moving from this bin to the left neighboring bin might be difficult.

We propose to consider the ratio of the number of points on the left side of the middle (dashed line) over the number of points within the bin as a very simple proxy for the discrepancy of the chains within one bin (see e.g. Niederreiter (1992) for much more sophisticated discrepancy measures). In a broad outline, if this ratio was around , the within-bin histogram would be roughly uniform. On the contrary, the ratio corresponding to Figure a is around . Our strategy is to split the bin if this ratio goes below a given threshold, say ; two new bins are created, corresponding to the left side and the right side of the former bin, and each bin is assigned a weight of where is the weight of the former bin. These provide starting values for the estimation of the weight of the new bins during the following iterations of the algorithm. Note also that the desired frequency of visits to each of the new bins, which was for instance equal to before the split, has to be specified as well. In the numerical experiments, we set the desired frequency of the new bins as one half of the desired frequency of the former bin. Figure b shows the distribution of samples within the two new bins. The resulting histogram is not uniform, yet exhibits a more even distribution within the bin – a feature which is expected to help the chain to move from this bin to the left neighboring bin. The threshold could be set closer to , which would result in more splits and therefore more bins.

In practice it is not necessary to check whether the bins have to be split at every iteration. Our strategy is to check every -th iteration, until the flat histogram criterion is met for the first time. When it is met, it means that the chains can move easily between the bins, and hence the bins can be kept unchanged for the remaining iterations. Finally, when implementing the automatic binning strategy for discrete distributions, one must ensure that a new bin corresponds to actual points in the state space. For example if the bins are along the energy values and the state space is finite, there are certainly intervals of energy to which no states corresponds, and that would therefore never be reached. Section 4 demonstrates the proposed adaptive binning strategy in practice.

(a) Unique bin before the split
(b) Two bins after the split
Figure 2: Artificial histograms of the log target density values associated to the chains generated by the algorithm, within a single bin (left) and within two bins created by splitting the former bin (right).

In addition to allowing for splitting of bins, it is also important to allow the range of bins to extend if new states are found outside of the particular range. That said, one must differentiate between the two extremes of the reaction coordinate. For example, if , then one might not wish to add more low-density (high-energy) bins, which would induce the sampler to explore further and further into the tails. However, if one finds a new high-density (low-energy) mode beyond the energy range previously seen, then the sampler might become stuck in this new mode. In this case, we propose to extend the first bin corresponding to the lowest level of energy to always include the lowest observed values. The adaptive partition of the state space takes the following form at time :

where and defines the limits of the inner bins at time , which is the result of initial bin limits , and possible splits between time and time . As such, if new low-energy values are found, the bin is widened. If this results in unequal exploration across the reaction coordinate, then the adaptive bin-splitting mechanism will automatically split this newly widened bin.

3.3 Parallel Interacting Chains

We propose to generate multiple chains instead of a single one to improve computational scalability through parallelization as well as particle diversity. The use of interacting chains has become of much interest in recent years, with the multiple chains used to create diverse proposals (Casarin et al., 2011), to induce long-range equi-energy jumps (Kou et al., 2006), and to generally improve sampler performance; see Atchadé et al. (2011), Brockwell et al. (2010), and Byrd (2010) for recent developments. The use of parallelization is not constrained to multiple chains, however, and has also been employed to speed up the generation of a single chain through pre-fetching (Brockwell, 2006).

Let be the desired number of chains. We follow Algorithm 2, with the following modifications. First we generate starting points independently from an initial distribution (Algorithm 2 line 4). Then at iteration , instead of moving one chain using the transition kernel , we move the chains using the same transition kernel, associated with the same bias (Algorithm 2 line 6). We emphasize that the bias is common to all chains, which makes the proposed method different from running Wang-Landau chains entirely in parallel. The proportions are updated using all the chains, simply by replacing the indicator function by the mean , that is the proportion of chains currently in set (Algorithm 2 line 7). Likewise the update of the bias uses all the chains, again replacing the indicator function by the proportion of chains currently in a given set (Algorithm 2 line 12). We have therefore replaced indicator functions in the Wang-Landau algorithm by the law of the MCMC chain associated with the current parameter. Since this law is not accessible, we perform a mean field approximation at each time step. A similar expression has recently been employed by Liang and Wu (2011) for use in parallelizing the stochastic approximation Monte Carlo algorithm (Liang et al., 2007; Liang, 2009). Note that while we have designed the chains to communicate at each iteration, such frequent message passing can be costly, particularly on graphics processing units. In such situations, one could alter the algorithm such that the chains only communicate periodically.

Our results (see Section 4) show that interacting chains run for iterations can strongly outperform a single chain run for iterations, in terms of variance of the resulting estimates. Specifically, having a sample approximately distributed according to instead of a single point at iteration improves and stabilizes the subsequent estimate . We explore the tradeoff between and in more detail in Section 4.

Note that, while the original single-chain Wang-Landau algorithm was not straightforward to parallelize due to its iterative nature, the proposed algorithm can strongly benefit from multiple processing units: at a given iteration the move steps can be done in parallel, as long as the results are consequently collected to update the bias before the next iteration. Therefore if multiple processors are available, as e.g. in recent central processing units and in graphics processing units (see e.g. Lee et al. (2010); Suchard and Rambaut (2009)), the computational cost can be reduced much more than what was possible with the single-chain Wang-Landau algorithm. To summarize, the proposed use of interacting chains can both improve the convergence of the estimates, regardless of the number of available processors, and additionally benefit from multiple processors.

Finally, an additional benefit of using parallel chains is that they can start from various points, drawn from the initial distribution ; hence if is flat enough, the chains can start from different local modes, which improves de facto the exploration. However, we show in Section 4 that the chains still explore the space even if they start within the same mode, and hence the efficiency of the method does not rely on the choice of , contrary to what we observed with sequential Monte Carlo methods. Additionally, because the sampler is attempting to explore both the state-space as well as the range of the reaction coordinate simultaneously, our parallel formulation allows the sampler to borrow strength between chains, providing for exploration of the reaction coordinate without having to move a single chain across potentially large and high-dimensional state-spaces to traverse the reaction coordinate values.

3.4 Adaptive Proposal Mechanism

As discussed earlier, it is important to automate the proposal mechanism to improve movement across the state space. A well-studied proxy for optimal movement is the algorithm’s Metropolis-Hastings acceptance rate. Too low an acceptance rate signifies the algorithm is attempting to make moves that are too large, and are therefore rejected. Too high an acceptance rate signifies the algorithm is barely moving. As such, we suggest adaptively tuning the proposal variance to encourage an acceptance rate of as recommended in Roberts et al. (1997), although we have found settings in the range to to work well in all examples tested. The Robbins-Monro stochastic approximation update of the proposal standard deviation is as follows:

(2)

where is the current iteration of the algorithm, is a decreasing sequence (typically ), and is the acceptance rate (proportion of accepted moves) of the particles. Through this update, the proposal variance grows after samples are accepted, and shrinks when samples are rejected, encouraging exploration of the state space.

Another approach to adaptively tuning the proposal distribution is to use the following mixture of Gaussian random-walks:

(3)

with , the empirical covariance of the chain history – an estimator of the covariance structure of the target – and the identity matrix where is the dimension of the target space. The first component of this mixture makes the proposal adaptive and able to learn from the past, while the second component helps to explore the space. For instance, if the chain is stuck in a mode, the first component’s variance might become small, yet the second component guarantees a chance to eventually escape the mode. Hence the second component acts as a “safety net” and therefore its weight is small, typically , and its standard deviation may be set large to improve mixing (Guan and Krone, 2007).

In our context where parallel chains are run together, we use all the chains to estimate the empirical covariance at each iteration. Note that the computation of this covariance does not require the storage of the whole history of the chain and can be done at constant cost, since recurrence formulae exist to compute the empirical covariance, as explained for instance in Welford (1962). The value is justified by asymptotic optimality reflections on certain classes of models (see, e.g., Roberts et al. (1997) and Roberts and Rosenthal (2009)).

3.5 Using the Output to Perform Inference

While the resulting samples from the proposed algorithm PAWL are not from , but rather an approximation of the biased version (1), one can use importance sampling or advanced sequential Monte Carlo ideas to transition the samples to (see Chopin and Jacob (2010) for details). Alternatively, the samples from the exploratory algorithm can be used to seed a more traditional MCMC algorithm, as advocated by Schmidler (2011).

The pseudo-code for PAWL, combining the parallel Wang-Landau algorithm with adaptive binning and proposal mechanisms, is given in the Appendix. Before proceeding to examples, it is important to reiterate the importance of the values . Specifically, certain choices of the reaction coordinate result in having inherent value. For example, it is possible in a model selection application to use the model order as , in which case the values could be employed to calculate Bayes factors and other quantities of interest.

4 Applications

We now demonstrate PAWL applied to three examples including variable selection, mixture modeling, and spatial imaging. A fourth pedagogical example is available in the appendices. In each application we walk through our proposed algorithm (described explicitly as Algorithm 3 in the Appendix), first running preliminary (adaptive) Metropolis-Hastings MCMC to determine the initial range for the reaction coordinate and initial values for the proposal parameters and starting state of the interacting Wang-Landau chains. This range is then increased to encourage exploration of low-density regions of the space, and an initial number of bins is specified. Once this groundwork is set, the same Metropolis-Hastings algorithm run in the preliminary stage is embedded within the PAWL algorithm.

4.1 -Prior Variable Selection

We proceed by conducting variable selection on the pollution data set of McDonald and Schwing (1973), wherein mortality is related to pollution levels through independent variables including mean annual precipitation, population per household, and average annual relative humidity. Measured across 60 metropolitan areas, the response variable is the age-adjusted mortality rate in the given metropolitan area. Our goal is to identify the pollution-related independent variables which best predict the response. With 15 variables, calculating the posterior probabilities of the models exactly is possible but time-consuming. We have chosen this size of data set to provide for difficult posterior exploration, yet allow a study of convergence of towards .

With an eye towards model selection, we introduce the binary indicator variable , where means the variable is included in the model. As such, can describe all of the possible models. Consider the normal likelihood

(5)

If is the model matrix which excludes all ’s if , we can employ the following prior distributions for and (Zellner, 1986; Marin and Robert, 2007):

where represents the number of variables in the model. While selecting can be a difficult problem, we have chosen it to be very large () to induce a sparse model, which is difficult to explore due to the small marginal probabilities of most variables. After integrating over the regression coefficients , the posterior density for is thus

(6)

While we select the log energy function as the reaction coordinate for our analysis, it is worth noting that many other options exist. For instance, it would be natural to consider the model saturation , which would ensure exploration across the different model sizes. However, we select to emphasize the universality of using the energy function as the reaction coordinate.

We first run a preliminary Metropolis-Hastings algorithm which flips a variable on/off at random, accepting or rejecting the flip based on the resulting posterior densities. Due to high correlation between variables, a better strategy might be to flip multiple variables at once; however, we restrain from exploring this to demonstrate PAWL’s ability to make viable even poorly designed Metropolis-Hastings algorithms. The preliminary algorithm run found values , which we extend slightly to create equally spaced bins in the range . It is worth reiterating that the resulting samples generated from PAWL are from a biased version of ; as such, importance sampling techniques could be used to recover , or the samples obtained could be used to seed a more traditional MCMC algorithm.

Due to the size of the problem, we are able to enumerate all posterior values, and hence may calculate exactly. As such, we begin by examining the effect of the number of particles on the parallel Wang-Landau algorithm. To further focus on this aspect, we suppress adaptive binning and proposals for this example. Figure 3 shows the convergence of to for . We see that the algorithm’s convergence improves with more particles. Using particles, we now examine PAWL compared to the Metropolis-Hastings algorithm (run on chains) mentioned above on the unnormalized targets . Consider Figure 4; on the target distribution , the Metropolis-Hastings algorithm becomes stuck in high-probability regions. However, on the tempered distributions, the algorithm explores the space more thoroughly, although not to the same level as PAWL. Specifically, PAWL explores a much wider range of models, including the highest probability models, whereas the tempered distributions do not. Here the Wang-Landau algorithm as well as the Metropolis-Hastings algorithm both use chains for iterations, the former taking seconds and the latter taking seconds across 10 runs, indicating that the additional cost for PAWL is negligible.

Figure 3: Variable selection example: convergence of Wang-Landau for . Iterations set such that each algorithm runs in minutes ( seconds). for each bin shown in solid lines. True values () shown as dotted lines.
Figure 4: Variable selection example: exploration of model space by PAWL and Metropolis-Hastings at temperature settings. (a) Model Saturation (proportion of non-zero variables in model) as function of algorithm iterations. The solid lines are the mean of chains, while the shaded regions are the middle of the chains.

4.2 Mixture Modeling

Mixture models provide a challenging case of multimodality, due partly to the complexity of the model and partly to a phenomenon called “label switching” (see e.g. Frühwirth-Schnatter (2006) for a book covering Bayesian inference for these models, Diebolt and Robert (1994) and Richardson and Green (1997) for seminal papers using MCMC for mixture models, and Stephens (2000) and Jasra et al. (2005) on the label switching problem). Articles describing explorative MCMC algorithms often take these models as benchmarks, as e.g. population MCMC and SMC methods in Jasra et al. (2007), the Wang-Landau algorithm in Atchadé and Liu (2010), free energy methods in Chopin et al. (2012); Chopin and Jacob (2010), and parallel tempering with equi-energy moves in Baragatti et al. (2012).

Consider a Bayesian Gaussian mixture model, i.e. for ,

where is the probability density function of the Gaussian distribution, is the number of components, , and are respectively the weight, the mean and the precision of component . The component index is also called its label. Following Richardson and Green (1997), the prior is taken as, for ,

with, e.g., , , , , range.

The invariance of the likelihood to permutations in the labelling of the components leads to the “label switching” problem: since there are possible permutations of the labels, each mode has necessarily replicates. We emphasize that this model has been thoroughly studied and is hence well-understood from a modeling point of view, but it still induces a computationally challenging sampling problem for which difficulty can be artificially increased through the number of components .

Note that in this parametrization , the rate of the Gamma prior distribution of the precisions , is estimated along with the parameters of interest , and . Chopin et al. (2012); Chopin and Jacob (2010) suggest that can be used as a reaction coordinate, since a large value of results in a small precision and hence in a flatter posterior distribution of the other parameters, which is easier to explore than the distribution associated with smaller values of ; we refer to these articles for further exploration of this choice of reaction coordinate, and instead default to .

We create a synthetic -sample from a Gaussian mixture with components, weights , means , , , and variances as in Jasra et al. (2005). The goal is to explore the highly multimodal posterior distribution of the -dimensional parameter where is the unnormalized weight: . Unnormalized weights may be handled straightforwardly in MCMC algorithms since they are defined on and not on the -simplex as with the .

The proposed algorithm is compared to a Sequential Monte Carlo sampler (SMC) and a parallel adaptive Metropolis–Hastings (PAMH) algorithm, that we detail below. We admittedly use naive versions of these competitors, arguing that most improvements of these could be carried over to PAWL. For instance, a mixture of Markov kernels as suggested for the SMC algorithm in Section 3.2 of Jasra et al. (2007) can be used in the proposal distribution of PAWL; and since PAWL is a population MCMC algorithm, exchange and crossover moves could be used as well, as suggested for the Population SAMC algorithm in Liang and Wu (2011). To get a plausible range of values for the reaction coordinate of the proposed algorithm without user input, an initial adaptive MCMC algorithm is run with chains and iterations. The initial points of these chains are drawn from the prior distribution of the parameters. This provides a range of log density values, from which we compute the and empirical quantiles, denoted by and respectively. In a conservative spirit, the bins are chosen to equally divide the interval in subsets. Hence the algorithm is going to explore log density values in a range that is approximately twice as large as the values initially explored. Note that we use quantiles instead of minimum and maximum values to make the method more robust.

Next, PAWL itself is run for iterations, starting from the terminal points of the preliminary chains, resulting in a total number of target density evaluations. In this situation, even with only data points, most of the computational cost goes into the evaluation of the target density. This confirms that algorithmic parameters such as the number of bins does not significantly affect the overall computational cost, at least as long as the target density is not extremely cheap to evaluate. The adaptive proposal is such that it targets an acceptance rate of . Meanwhile the PAMH algorithm using the same adaptive proposal is run with chains and iterations, hence relying on more target density evaluations for a comparable computational cost.

Finally, the SMC algorithm is run on a sequence of tempered distribution , each density being defined by:

where is an initial distribution (here taken to be the prior distribution), and . The number of steps is set to and the number of particles to . When the Effective Sample Size (ESS) goes below , we perform a systematic resampling and consecutive Metropolis–Hastings moves. We use a random walk proposal distribution, which variance is taken to be where is the empirical covariance of the particles and is set to ; see Jasra et al. (2007) for more details. The parameters are chosen to induce a computational cost comparable to the other methods. However for the SMC sampler the number of target density evaluations is a random number, since it depends on the random number of resampling steps: the computational cost is in general less predictable than using MCMC.

First we look at graphical representations of the generated samples. Figure d shows the resulting points projected on the plane, restricted on . In this plane there are replicates of each mode, indicated by target symbols in Figure a. These projections do not allow one to check that all the modes were visited since they project the -dimensional target space on a -dimensional space. Figure b shows that the adaptive MCMC method clearly misses some of the modes, while visiting many others. Figure c shows how the chains generated by the modified Wang-Landau algorithm easily explore the space of interest, visiting both global and local modes. To recover the main modes from these chains, we use the final value of the bias, , as importance weights to correct for the bias induced by the algorithm; in Figure c the importance weights define the transparency of each point: the darker the point, the more weight it has. Finally Figure d shows how the SMC sampler also put particles in each mode; again the transparency of the points is proportional to their weights.

We now turn to more quantitative measures of the error made on marginal quantities. Since the component means admit identical modes around , , and , we know that their overall mean is approximately equal to . We then compute the following error measurement:

where is the mean of the generated sample (taking the weights into account for PAWL and SMC). Table 1 shows the results averaged over independent runs: the means of each component, the error defined above and the (wall-clock) run times obtained using the same CPU, along with their standard deviations. The results highlight that in this context PAWL gives more precise results than SMC, for the same or less computational cost; the comparison between parallel MCMC and SMC confirms the results obtained in Jasra et al. (2007). The small benefit of PAWL over PAMH can be explained by considering the symmetry of the posterior distribution: even if some modes are missed by PAMH as shown in Figure b, the approximation of the posterior expectation might be accurate, though the corresponding variance will be higher.

Method Error Time (s)
PAWL 1.42 0.99 1.42 0.58 1.39 0.90 1.75 0.78 1.50 0.59 209 1
PAMH 1.58 0.81 1.25 0.72 1.04 1.07 2.09 1.00 1.75 0.80 233 1
SMC 1.00 1.96 2.99 1.38 0.92 2.27 1.10 2.11 3.89 1.34 269 7
Table 1: Estimation of the means of the mixture components, for the proposed method (PAWL), Parallel Adaptive Metropolis–Hastings (PAMH) and Sequential Monte Carlo (SMC), using the prior as initial distribution. Quantities averaged over independent runs for each method.

Next we consider a more realistic setting, where the initial distribution is not well spread over the parameter values: instead of taking the prior distribution itself, we use a similar distribution but with an hyperparameter equal to instead of , which for our simulated data set is equal to . This higher precision makes the initial distribution concentrated on a few modes, instead of being fairly flat over the whole region of interest. We keep the prior unchanged, so that the posterior is left unchanged. For PAWL and PAMH, this means that the initial points of the chains are all close one to another; and likewise for the initial particles in the SMC sampler. The results are shown in Table 2, and illustrate the degeneracy of SMC when the initial distribution is not well-chosen; though this is not surprising, this is important in terms of exploratory algorithms when one does not have prior knowledge of the region of interest. Both parallel MCMC methods give similar results as with the previous, flatter initial distribution.

Method Error Time
PAWL 1.16 0.75 2.04 0.50 1.72 0.80 1.07 1.22 1.48 1.10 210 1
PAMH 1.37 0.73 1.48 1.39 1.71 0.81 1.44 1.11 1.75 1.01 234 1
SMC 0.35 2.13 0.82 1.55 3.19 2.41 1.62 1.85 4.17 1.41 337 8
Table 2: Estimation of the means of the mixture components, for the proposed method (PAWL), Parallel Adaptive Metropolis–Hastings (PAMH) and Sequential Monte Carlo (SMC), using a concentrated initial distribution. Quantities averaged over independent runs for each method.

Finally, we compare different algorithmic settings for the PAWL algorithm, changing the number of chains and the number of iterations. The results are shown in Table 3. First we see that, even on a single CPU, the computing time is not exactly proportional to , the number of target density evaluation. Indeed the computations are vectorized by iteration, and hence it is typically cheaper to compute one iteration of chains than iterations of chain; although this would not hold for every model. We also see that the algorithm using only one chain failed to explore the modes, resulting in a huge final error. Finally we see that with chains and only iterations, the algorithm provides results of approximately the same precision as with chains and iterations. This suggests that the algorithm might be particularly interesting if parallel processing units are available, since the computational cost would then be much reduced.

Parameters Error Time
0.37 3.46 2.01 3.27 2.53 3.04 0.95 3.46 6.39 1.30 265 40
1.42 0.99 1.42 0.58 1.39 0.90 1.75 0.78 1.50 0.59 209 1
1.51 0.88 1.5 0.9 1.65 0.64 1.31 0.31 1.22 0.69 178 2
Table 3: Estimation of the means of the mixture components, for the proposed method (PAWL), for different values of , the number of chains, and , the number of iterations. Quantities averaged over independent runs for each set of parameters.
(a) Locations of the global modes of the posterior distribution projected on
(b) Projection of the chains generated by the parallel adaptive MH algorithm on
(c) Projection of the chains generated by PAWL on
(d) Projection of the particles generated by the SMC algorithm on
Figure 5: Mixture model example: exploration of the posterior distribution projected on the plane, using different algorithms.

4.3 Spatial Imaging

We finish our examples by identifying ice floes from polar satellite images as described in Banfield and Raftery (1992). Here the image under consideration is a by gray-scale satellite image, with focus on a particular by region (, Figure 6); the goal is to identify the presence and position of polar ice floes (). Towards this goal, Higdon (1998) employs a Bayesian model. Basing the likelihood on similarity to the image and employing an Ising model prior, the resulting posterior distribution is

The first term, the likelihood, encourages states which are similar to the original image . The second term, the prior, favors states for which neighbouring pixels are equal. Here neighbourhood () is defined as the vertical, horizontal, and diagonal adjacencies of each (interior) pixel.

Because the prior strongly prefers large blocks of identical pixels, an MCMC method which proposes to flip one pixel at a time will fail to explore the posterior, and hence Higdon (1998) suggests a partial decoupling technique specific to these types of models. However, to demonstrate PAWL’s power and universality, we demonstrate its ability to make simple one-at-a-time Metropolis-Hastings feasible in these models without more advanced decoupling methods.

(a) Original Image
(b) Focused Region of Image
Figure 6: Spatial model example: (a) original ice floe image with highlighted region. (b) close-up of focused region.

First running a preliminary Metropolis-Hastings algorithm of length , we use the range of explored energy values divided evenly across bins. The algorithm subsequently splits bins 6 times (with splitting stopped once the algorithm reaches the extremes of the reaction coordinate values) resulting in bins at the algorithm’s conclusion. For both algorithms, we run chains for iterations with model parameters , . Due to the flip-one-pixel approach, we suppress adaptive proposals for this example. In contrast to the mixture modeling example, in this example the target density is fairly straightforward to calculate, so it is a good worst-case comparison to demonstrate the additional time taken by the proposed algorithm. For this example, the MH algorithm took seconds across 10 runs, whereas PAWL required seconds. Thus in this case the Wang-Landau adds a 23% price to each iteration on average. However, as we will show, the exploration is significantly better, justifying the slight additional cost. Figure 7 shows a subset of the last posterior realizations from one chain of each algorithm. We see that the proposed Wang-Landau algorithm encourages much more exploration of alternate modes. The corresponding average state explored over all chains (after burn-in) is shown in Figure 8. From this we see that Wang-Landau induces exploration of the mode in the top-left of the region in question, as well as a bridge between the central ice floes. In conclusion, while flip-one-pixel Metropolis-Hastings is incapable of exploring the modes in the posterior caused by the presence/absence of large ice floes, the proposed algorithm encourages exploration of these modes, even in the presence of high between-pixel correlation. While Higdon (1998) develops a custom-tailored MCMC solution to overcome the inability of Metropolis-Hastings to adequately explore the posterior density in Ising models, we employ PAWL – a general-purpose automatic density exploration algorithm – to achieve similar results.

Figure 7: Spatial model example: states explored over 400,000 iterations for Metropolis-Hastings (top) and proposed algorithm (bottom).
Figure 8: Spatial model example: average state explored with Metropolis-Hastings (left) and PAWL after importance sampling (right).

5 Discussion and Conclusion

The proposed algorithm, PAWL, has at its core the Wang-Landau algorithm which, despite wide-spread use in the physics community, has only recently been introduced into the statistics literature. A well-known obstacle in implementing the Wang-Landau algorithm is selecting the bins through which to discretize the state space; in response, we have developed a novel adaptive binning strategy. Additionally, we employ an adaptive proposal mechanism to further reduce the amount of user-defined parameters. Finally, to improve the convergence speed of the algorithm and to exploit modern computational power, we have developed a parallel interacting chain version of the algorithm which proves efficient in stabilizing the algorithm. Through a host of examples, we have demonstrated the algorithm’s ability to conduct density exploration over a wide range of distributions continuous and discrete. While a suite of custom-purposed MCMC tools exist in the literature for each of these models, the proposed algorithm handles each within the same unified framework.

As practitioners in fields ranging from image-processing to astronomy turn to increasingly complex models to represent intricate real-world phenomena, the computational tools to approximate these models must grow accordingly. In this paper, we have proposed a general-purpose algorithm for automatic density exploration. Due to its fully adaptive nature, we foresee its application as a black-box exploratory MCMC method aimed at practitioners of Bayesian methods. While statisticians are well-accustomed to performing exploratory analysis in the modeling stage of an analysis, the notion of conducting preliminary general-purpose exploratory analysis in the Monte Carlo stage (or more generally, the model-fitting stage) of an analysis is an area which we feels deserves much further attention. As models grow in complexity, and endless model-specific Monte Carlo methods are proposed, it is valuable for the practitioner to have a universally applicable tool to throw at their problem before embarking on custom-tuned, hand-built Monte Carlo methods. Towards this aim, the authors have published an R package (”PAWL”) to minimize user effort in applying the proposed algorithm to their specific problem.

Acknowledgements

Luke Bornn is supported by grants from NSERC and The Michael Smith Foundation for Health Research. Pierre E. Jacob is supported by a PhD fellowship from the AXA Research Fund. The authors also acknowledge the use of R and ggplot2 in their analyses (R Development Core Team, 2010; Wickham, 2009). Finally, the authors are thankful for helpful conversations with Yves Atchadé, François Caron, Nicolas Chopin, Faming Liang, Eric Moulines, and Christian Robert.

\setstretch

1.4

\setstretch

1.48

Appendix A Details of Proposed Algorithm

In Algorithm 3 we detail PAWL, fusing together a Wang-Landau base with adaptive binning, interacting parallel chains, and an adaptive proposal mechanism. In comparison to the generalized Wang-Landau algorithm (Algorithm 2), when a flat histogram is reached the distribution of particles within bins is tested to determine whether a given bin should be split. In addition, a suite of interacting chains is employed, and hence the former chain is now made of chains: , each defined on the state space . All the chains are used to update the bias , as described in Section 3.3.

The chains are moved using an adaptive mechanism determined by the Metropolis-Hastings acceptance rate as explained in Section 3.4. While we present Algorithm 3 with adaptive proposal variance, it may also be implemented with an adaptive mixture proposal as described in Section 3.4. Note that when a bin is split, it is possible to set the desired frequency of the new bins to some reduced value, say each obtaining half the desired frequency of the original – in fact in the numerical experiments we do exactly that. However, for notational and pedagogical simplicity, we present here the algorithm where the desired frequency of each bin is equal to at iteration .

\setstretch

1.15

1:  Run a preliminary exploration of the target e.g. using adaptive MCMC, and determine an energy range.
2:  Partition the state space into regions along a reaction coordinate ,the default choice being .
3:   set .
4:  Choose an initial proposal standard deviation
5:  Choose the frequency with which to check for a flat histogram.
6:  Choose a decreasing sequence , typically , to update the proposal standard deviation.
7:  Choose a decreasing sequence , typically , to update the bias.
8:  Sample , an initial distribution.
9:  for  to  do
10:     Sample from , a transition kernel with invariant distribution , parametrized by the proposal standard deviation .
11:     Update the proposal standard deviation: ,where is the last acceptance rate.
12:     Set .
13:     Update the proportions: .
14:     Every -th iteration, check the distribution of samples within each bin, extending the range if necessary. For example, if if and a new minimum value of was found, extend the first bin in order to include this value.
15:     for  do
16:        if bin should be split then
17:           Create two sub-bins covering bin , assign to each a weight equal to .
18:           Set , extend .
19:        end if
20:     end for
21:     if “flat histogram”:  then
22:        Set .
23:        Reset .
24:     end if
25:     Update the bias: .
26:     Normalize the bias: .
27:  end for
Algorithm 3 Proposed Density Exploration Algorithm
\setstretch

1.48

Appendix B Algorithm Convergence

The convergence of the Wang-Landau algorithm using a deterministic stepsize, also called the Stochastic Approximation Monte Carlo (SAMC) algorithm, and stochastic approximation algorithms in general, has been well-studied in Atchadé and Liu (2010); Andrieu and Moulines (2006); Andrieu et al. (2005); Liang and Wu (2011), see Chapter 7 of Liang et al. (2010) for a recent introduction. Since writing this manuscript, we have also learned of recent convergence and ergodicity results for a parallel implementation of the SAMC algorithm (Liang and Wu, 2011). However, as noted in Liang and Wu (2011) these results fail to explain why the parallel version is more efficient than the single-chain algorithm in practice; instead it proves the consistency of the algorithm when the number of iterations goes to infinity, and the asymptotic normality of the bias , for any fixed number of chains. We believe that precise statements on the impact of the number of chains upon the stabilization of the bias would require the analysis of the Feynman–Kac semigroup associated with the algorithm, similar to what is commonly used to study the impact of the number of particles in Sequential Monte Carlo methods (Del Moral, 2004).

Each of our proposed improvements adds a level of complexity to the proof of the algorithm’s consistency. First and foremost, we are using the Flat Histogram criterion, and thus the usual assumptions on the stepsize of the stochastic approximation are not easily verified (e.g. assumptions of Theorem 2.3 in Andrieu et al. (2005) and conditions (A4) in Liang and Wu (2011)). Indeed, if no flat histogram criterion was met, then the stepsize would stay constant. We rely on a result in Jacob and Ryder (2011) that proves that the criterion is met in finite time, for any precision threshold ; therefore the results of Andrieu et al. (2005) and thus the results of Liang and Wu (2011) apply even when one uses the flat histogram criterion.

Finally, with our inclusion of an adaptive proposal as a Robbins-Monro style update, the algorithm still remains in the class of stochastic approximation algorithms. One could pragmatically stop the adaptation of the proposal distribution after some iteration and fall back to the study of a homogeneous Metropolis–Hastings algorithm. However, we believe that the algorithm could be studied in the same framework as Andrieu and Moulines (2006); Liang and Wu (2011), where now the stochastic approximation would both control the bias and the standard deviation of the proposal .

Appendix C Trimodal Target Example

We introduce a toy target distribution to aid in demonstrating some of the concepts discussed earlier, especially the bin splitting strategy. Consider the 2-dimensional trimodal target described in Liang et al. (2007):

(4)

a mixture of three bivariate Gaussian distributions. The corresponding log density is shown in Figure 9. This density, while low-dimensional and with only three modes, is known to be difficult to sample from with Metropolis-Hastings (e.g. Gilks et al. 1998). Firstly, with different correlation structures in each mode, an adaptive algorithm might conform to one mode, missing (or poorly sampling from) the other two. Secondly, there is a low-density region separating each mode; as such, low-variance proposals might be incapable of jumping between modes. Figure 10 displays the biased target (Equation (1), using the log density as reaction coordinate) and one of its marginals, emphasizing the effect of biasing in improving the ability for the algorithm to explore the density. Here the plot is created using computationally expensive fine-scale numerical integration.

Figure 9: Trimodal example: log density function of the target distribution (4). The modes are separated by areas where the log density is very low, making exploration difficult.
(a) Biased target density
(b) Marginal of biased and unbiased target density
Figure 10: Trimodal example: (left) log density of the biased target distribution (Equation (1)). The former modes are now all located in a fairly flat region, allowing for straightforward exploration. (Right) marginal log density of one component of the (symmetric) trimodal target. The solid line shows the target probability density function and the dashed line shows an approximation of the marginal of the biased distribution of Equation (1). The biased marginal is flatter, hence easier to explore than the original target distribution.

Initial exploration is performed by an adaptive MCMC algorithm, run with parallel chains and iterations. The proposal of this first run targets a specific acceptance rate of , as described in Section 3.4. The explored energy range is expanded and divided into initial bins. In all examples, we use . The proposed algorithm is run with parallel chains for iterations. Figure a shows the regions recovered by the chains; the chains have moved easily between the modes, even if the distribution of the starting point was not well spread over the target density support. In this case, to reflect the possible lack of information about the target support, we draw the starting points of all the chains from a distribution, hence exclusively in one of the three modes. In this setting the free energy SMC method described in Chopin and Jacob (2010) fails to recover the target distribution accurately; specifically, the central mode is over-sampled due to many particles not reaching the outer modes. However, if the initial distribution is well spread over the target support, the SMC algorithm recovers the modes. Figure b shows the points generated by iterations of the adaptive Metropolis–Hastings algorithm (already used in the initial exploration), also using chains. We see that the exploration was less successful, with the bottom left mode hardly visited at all, although the same number of point-wise density evaluations were performed as for the proposed algorithm.

(a) Scatter plot of the chains generated by the proposed algorithm
(b) Scatter plot of the chains generated by an adaptive Metropolis–Hastings algorithm
Figure 11: Trimodal example: results of the proposed algorithm: (a) Scatter plot of all samples, before normalization/importance sampling, (b) Scatter plot of samples generated by an adaptive Metropolis–Hastings algorithm using the same number of chains and iterations.

Along the iterations, the bins have been split three times. Here the chosen strategy was to split a bin if less than of its points were situated in half of the bin. Figure b illustrates the effects of the binning strategy. Figure a shows the trace plot of the estimators , and the iterations at which bins are split are shown with vertical lines. After each split, the dimension of increases but the figure shows that the new estimators quickly stabilize. After the last split around iteration , the number of bins stays constant. Figure b shows the histogram of the log density evaluations of the chain points, with vertical full lines showing the initial bins, and vertical dashed lines showing the bins that have been added during the run. We see that the bin splits induce more uniformity within bins, and hence across the entire reaction coordinate range, aiding in movement and exploration.

(a) Trace plot of , the log penalties, with vertical lines indicating the bin splits.
(b) Histogram of the energy values computed during the algorithm. Vertical full lines show the initial bins and dashed lines show the cuts that have been added dynamically.
Figure 12: Trimodal example: histograms of the log density values of all the chain points just before the iterations at which the splitting mechanism is triggered. The number of bins increases automatically along the iterations.

References

  1. Andrieu, C. and Moulines, E. (2006). On the ergodicity properties of some adaptive MCMC algorithms. Annals of Applied Probability, 16(3):1462.
  2. Andrieu, C., Moulines, E., and Priouret, P. (2005). Stability of stochastic approximation under verifiable conditions. SIAM Journal on control and optimization, 44(1):283–312.
  3. Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and Computing, 18(4):343–373.
  4. Atchadé, Y., Fort, G., Moulines, E., and Priouret, P. (2011). Adaptive Markov Chain Monte Carlo: Theory and Methods, chapter 2, pages 33–53. Cambridge University Press, Cambridge, UK.
  5. Atchadé, Y. and Liu, J. (2010). The Wang-Landau algorithm in general state spaces: applications and convergence analysis. Statistica Sinica, 20:209–233.
  6. Banfield, J. and Raftery, A. (1992). Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 87(417):7–16.
  7. Baragatti, M., Grimaud, A., and Pommeret, D. (2012). Parallel tempering with equi-energy moves. Statistics and Computing, pages 1–17.
  8. Besag, J. and Green, P. (1993). Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society. Series B (Methodological), 55(1):25–37.
  9. Brockwell, A. (2006). Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational and Graphical Statistics, 15(1):246–261.
  10. Brockwell, A., Del Moral, P., and Doucet, A. (2010). Sequentially interacting Markov chain Monte Carlo methods. The Annals of Statistics, 38(6):3387–3411.
  11. Byrd, J. (2010). Parallel Markov Chain Monte Carlo. PhD thesis, University of Warwick.
  12. Casarin, R., Craiu, R., and Leisen, F. (2011). Interacting multiple try algorithms with different proposal distributions. Statistics and Computing, pages 1–16.
  13. Chopin, N. and Jacob, P. (2010). Free energy Sequential Monte Carlo, application to mixture modelling. Bayesian Statistics 9.
  14. Chopin, N., Lelievre, T., and Stoltz, G. (2012). Free energy methods for Bayesian statistics: Efficient exploration of univariate Gaussian mixture posteriors. Statistics and Computing, 22(4):897–916.
  15. Craiu, R. V., Rosenthal, J., and Yang, C. (2009). Learn from thy neighbor: Parallel-chain and regional adaptive MCMC. Journal of the American Statistical Association, 104(488):1454–1466.
  16. Del Moral, P. (2004). Feynman-Kac Formulae. Springer.
  17. Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B, 68(3):411–436.
  18. Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society. Series B (Methodological), 56(2):pp. 363–375.
  19. Edwards, R. and Sokal, A. (1988). Generalization of the Fortuin-Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm. Physical review D, 38(6):2009–2012.
  20. Frühwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models. Springer.
  21. Geyer, C. (1991). Markov chain Monte Carlo maximum likelihood. In Keramigas, E., editor, Proceedings of the 23rd Symposium on the Interface, pages 156–163, Fairfax. Interface Foundations.
  22. Gilks, W., Roberts, G., and Sahu, S. (1998). Adaptive Markov chain Monte Carlo through regeneration. Journal of the American Statistical Association, 93(443):1045–1054.
  23. Guan, Y. and Krone, S. (2007). Small-world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing. The Annals of Applied Probability, 17(1):284–304.
  24. Haario, H., Saksman, E., and Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242.
  25. Higdon, D. (1998). Auxiliary variable methods for Markov chain Monte Carlo with applications. Journal of the American Statistical Association, 93(442).
  26. Jacob, P. E. and Ryder, R. J. (2011). The Wang-Landau algorithm reaches the Flat Histogram criterion in finite time. ArXiv e-prints.
  27. Jasra, A., Holmes, C. C., and Stephens, D. A. (2005). MCMC and the label switching problem in Bayesian mixture models. Statistical Science, 20:50–67.
  28. Jasra, A., Stephens, D., and Holmes, C. (2007). On population-based simulation for static inference. Statistics and Computing, 17(3):263–279.
  29. Kou, S., Zhou, Q., and Wong, W. (2006). Equi-energy sampler with applications in statistical inference and statistical mechanics. The Annals of Statistics, 34(4):1581–1619.
  30. Lee, A., Yau, C., Giles, M., Doucet, A., and Holmes, C. (2010). On the utility of graphics cards to perform massively parallel simulation of advanced monte carlo methods. Journal of Computational and Graphical Statistics, 19(4):769–789.
  31. Liang, F. (2005). A generalized Wang-Landau algorithm for Monte Carlo computation. Journal of the American Statistical Association, 100(472):1311–1327.
  32. Liang, F. (2009). Improving SAMC using smoothing methods: Theory and applications to Bayesian model selection problems. The Annals of Statistics, 37(5B):2626–2654.
  33. Liang, F., Liu, C., and Carrol, R. (2010). Advanced Markov chain Monte Carlo methods. Wiley Online Library.
  34. Liang, F., Liu, C., and Carroll, R. (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association, 102(477):305.
  35. Liang, F. and Wu, M. (2011). Population Stochastic Approximation MCMC algorithm and its weak convergence. Technical Report, Texas A& M University.
  36. Marin, J. and Robert, C. (2007). Bayesian core: a practical approach to computational Bayesian statistics. Springer Verlag.
  37. Marinari, E. and Parisi, G. (1992). Simulated tempering: a new Monte Carlo scheme. EPL (Europhysics Letters), 19:451.
  38. McDonald, G. and Schwing, R. (1973). Instabilities of regression estimates relating air pollution to mortality. Technometrics, 15(3):463–481.
  39. Neal, R. (2001). Annealed importance sampling. Statistics and Computing, 11(2):125–139.
  40. Neal, R. (2003). Slice sampling. Annals of Statistics, 31(3):705–741.
  41. Niederreiter, H. (1992). Random number generation and quasi-Monte Carlo methods. Society for Industrial Mathematics.
  42. R Development Core Team (2010). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
  43. Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4):731–792.
  44. Robert, C. and Casella, G. (2004). Monte Carlo Statistical Methods. Springer.
  45. Roberts, G., Gelman, A., and Gilks, W. (1997). Weak convergence and optimal scaling of random walk Metropolis algorithms. The Annals of Applied Probability, 7(1):110–120.
  46. Roberts, G. and Rosenthal, J. (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18:349–367.
  47. Schmidler, S. (2011). Exploration vs. exploitation in adaptive MCMC. Adap’ski Invited Presentation.
  48. Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B, 62(4):795–809.
  49. Suchard, M. and Rambaut, A. (2009). Many-core algorithms for statistical phylogenetics. Bioinformatics, 25(11):1370.
  50. Swendsen, R. and Wang, J. (1986). Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607–2609.
  51. Swendsen, R. and Wang, J. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58(2):86–88.
  52. Wang, F. and Landau, D. (2001a). Determining the density of states for classical statistical models: A random walk algorithm to produce a flat histogram. Physical Review E, 64(5):56101.
  53. Wang, F. and Landau, D. (2001b). Efficient, multiple-range random walk algorithm to calculate the density of states. Physical Review Letters, 86(10):2050–2053.
  54. Wei, W., Erenrich, J., and Selman, B. (2004). Towards efficient sampling: Exploiting random walk strategies. In Proceedings of the National Conference on Artificial Intelligence, pages 670–676. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
  55. Welford, B. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420.
  56. Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. Springer New York.
  57. Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Goel, P. and Zellner, A., editors, Bayesian Inference and Decision Techniques: Essays in Honor of Bruno do Dinetti, pages 233–243. North-Holland.
Comments 0
Request Comment
You are adding your first comment
How to quickly get a good reply:
  • Give credit where it’s Due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should ultimately help the author improve the paper.
  • Remember, the better we can be at sharing our knowledge with each other, the faster we can move forward.
""
The feedback must be of minumum 40 characters
Add comment
Cancel
Loading ...
113355
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description