Adaptive Metropolis-Hastings Sampling using Reversible Dependent Mixture Proposals

# Adaptive Metropolis-Hastings Sampling using Reversible Dependent Mixture Proposals

Minh-Ngoc Tran
University of New South Wales, Australia
minh-ngoc.tran@unsw.edu.au
Michael K. Pitt
Department of Economics
University of Warwick, United Kingdom
M.Pitt@warwick.ac.uk
Robert Kohn
University of New South Wales, Australia
r.kohn@unsw.edu.au
###### Abstract

This article develops a general-purpose adaptive sampler that approximates the target density by a mixture of multivariate densities. The adaptive sampler is based on reversible proposal distributions each of which has the mixture of multivariate densities as its invariant density. The reversible proposals consist of a combination of independent and correlated steps that allow the sampler to traverse the parameter space efficiently as well as allowing the sampler to keep moving and locally exploring the parameter space. We employ a two-chain approach, in which a trial chain is used to adapt the proposal densities used in the main chain. Convergence of the main chain and a strong law of large numbers are proved under reasonable conditions, and without imposing a Diminishing Adaptation condition. The mixtures of multivariate densities are fitted by an efficient Variational Approximation algorithm in which the number of components is determined automatically. The performance of the sampler is evaluated using simulated and real examples. Our autocorrelated framework is quite general and can handle mixtures other than multivariate .

Keywords. Ergodic convergence; Markov Chain Monte Carlo; Metropolis-within Gibbs composite sampling; Multivariate mixtures; Simulated annealing; Variational Approximation.

## 1 Introduction

Suppose that we wish to sample from a target distribution using a Metropolis-Hastings sampling method. For a traditional Metropolis-Hastings algorithm, where the proposal distribution is fixed in advance, it is well known that the success of the sampling method depends heavily on how the proposal distribution is selected. It is challenging to develop non-adaptive proposals in several types of problems. One example is when the target density is highly non-standard and/or multimodal. A second example is when the parameters are structural and deeply embedded in the likelihood so that it is difficult to differentiate the likelihood with respect to the likelihood; see for example Schmidl et al. (2013) who consider dynamic model with a regression function that is obtained as a solution to a differential equation. In such cases adaptive sampling, which sequentially updates the proposal distribution based on the previous iterates, has been shown useful. See, e.g., Haario et al. (1999, 2001); Roberts and Rosenthal (2007, 2009); Holden et al. (2009) and Giordani and Kohn (2010).

Our article constructs a general-purpose adaptive sampler that we call the Adaptive Correlated Metropolis-Hastings (ACMH) sampler. The sampler is described in Section 4. The first contribution of the article is to propose a two-chain approach to construct the proposal densities, with the iterates of the first chain used to construct the proposal densities used in the second (main) chain. The ACMH sampler approximates the target density by a sequence of mixtures of multivariate densities. The heavy tails of and mixture of distributions is a desirable property that a proposal distribution should have. Each mixture of distribution is fitted by a Variational Approximation algorithm which automatically selects the number of components. Variational Approximation is now well known as a computationally efficient method for estimating complex density functions; see, e.g., McGrory and Titterington (2007) and Giordani et al. (2012). An attractive property of Variational Approximation that makes it suitable for constructing proposal distributions is that it can locate the modes quickly and efficiently.

The second, and main, contribution of the article is to introduce in Section 3 a method to construct reversible proposal densities, each of which has the mixture of approximation as its invariant density. The proposal densities consist of both autocorrelated and independent Metropolis-Hastings steps. Independent steps allow the sampler to traverse the parameter space efficiently, while correlated steps allow the sampler to keep moving, (i.e., avoid getting stuck), while also exploring the parameter space locally. If the approximating mixture is close to the target, then the reversible proposals introduced in our article will allow the sampler to move easily and overcome the low acceptance rates often encountered by purely independent proposals. Note that the reversible correlated proposals we introduce are quite general and it is only necessary to be able to generate from them but it is unnecessary to be able to evaluate them. This is an important property for the correlated mixtures of that we use.

The third contribution of the paper is to show in Section 2 that the ACMH sampler converges uniformly to the target density and to obtain a strong law of large numbers under reasonable conditions, without requiring that Diminishing Adaptation holds. As pointed out above, it is difficult to impose Diminishing Adaptation in a natural way for general proposals such as those in our paper.

Adaptive sampling algorithms can be categorized into two groups: exploitative and exploratory algorithms (Schmidler, 2011). Exploitative algorithms attempt to improve on features of the target distribution that have already been seen by the sampler, i.e. based on the past iterations to improve on what have been discovered by the past iterations. The adaptive samplers of Haario et al. (2001) and Giordani and Kohn (2010) belong to this group. The second group encourages exploring the whole support of the target, including tempering (Geyer and Thompson, 1995), simulated annealing (Neal, 2001) and the Wang-Landau algorithm (Wang and Landau, 2001a, b). It is therefore useful to develop a general-purpose adaptive sampler that can be both exploratory and exploitative. An important feature of the ACMH sampler is the use, in Section 5, of an exploratory stage to initialize the adaptive chain. In particular, we describe in this paper how to use simulated annealing (Neal, 2001) to initialize the chain. Giordani and Kohn (2010) suggest initializing the chain using either random walk steps or by using a Laplace approximation, neither of which work well for targets that are multimodal and/or have a non-standard support. Initializing by an exploratory algorithm helps the sampler initially explore efficiently the features of the target, and these features will be improved in the subsequent exploitative stage. Section 6 shows that such a combination makes the ACMH sampler work well for challenging targets where many other samplers may fail.

A second feature of the ACMH sampler is that it uses a small proportion of adaptive random walk steps in order to explore tail regions around local modes more effectively. A third feature of the ACMH sampler is that it uses Metropolis-within-Gibbs component-wise sampling to make the sampler move more efficiently in high dimensions, where it is often difficult to efficiently move the whole state vector as a single component because of the large differences in the values of the target and the proposal at the current and proposed states. See Johnson et al. (2011) and its references for some convergence results on such composite MCMC sampling.

Section 6 presents simulation studies and Section 7 applies the adaptive sampling scheme to estimate the covariance matrix for a financial data set and analyze a spam email data set.

There are two important and immediate extensions of our work, which are discussed in Section 8. The first is to more general reversible mixture proposals. The second is to problems where the likelihood cannot be evaluated explicitly, but can be estimated unbiasedly.

Giordani and Kohn (2010) construct a general-purpose adaptive independent Metropolis-Hastings sampler that uses a mixture of normals as the proposal distribution. Their adaptive sampler works well in many cases because it is flexible and so helps the proposal approximate the target distribution better. They use the -means algorithm to estimate the mixtures of normals. Although this method is fast, using independent Metropolis-Hastings steps only may result in a low acceptance rate that may not explore the local features of the state space as effectively. In addition, the automatic selection of components used in our article works appreciably better than using BIC, as done in Giordani and Kohn (2010).

de Freitas et al. (2001) use Variational Approximation to first estimate the target density and then use this approximation to form a fixed proposal density within an MCMC scheme. There are two problems with this approach, which are discussed in Section 4.1.

Holden et al. (2009) provide a framework for constructing adaptive samplers that ensures ergodicity without assuming Diminishing Adaptation. Not imposing Diminishing Adaptation is attractive because it means that the adaptation can continue indefinitely if new features of the target are learned. However, we believe that the Holden et al. (2009) framework is unnecessarily limited for two reasons. First, it does not use the information about the target obtained from dependent steps. Second, it augments the history on which the adaptation is based by using proposals that are rejected by the Metropolis-Hastings method; such inclusions typically lead to suboptimal adaptive proposals in our experience.

Hoogerheide et al. (2012) also use a multivariate mixture of proposal densities which they fit using the EM algorithm. However, they stop adapting after a preliminary stage and do not have a principled way of choosing the number of components. In addition, their approach is harder to use when the likelihood cannot be computed, but can be estimated unbiasedly. Schmidl et al. (2013) propose an adaptive approach based on a vine copula but they stop adapting after a fixed number of adaptive steps. We note that it is straightforward to extend the multivariate mixture of approach in Hoogerheide et al. (2012) and the copula approach in Schmidl et al. (2013) to reversible proposals and to a two chain adaptive solution as in our article.

Craiu et al. (2009) also emphasize the importance of initial exploration and combine exploratory and exploitative stages using parallel an inter-chain adaptation algorithm to initially explore the sample space. They run many chains in parallel and let them interact in order to explore the modes, while we use annealed sampling with an SMC sampler. Their main contribution is the regional adaption algorithm, but they only discuss the case with two regions/two modes. It does not seem straightforward to extend the algorithm to general multimodal cases and it seems difficult to determine the number of regions/modes.

## 2 The adaptive sampling framework

Let be the target distribution with corresponding density . We consider using a two-chain approach to adapt the proposal densities. The idea is to run simultaneously two chains, a trial chain and a main chain , where the proposal densities used by the main chain are estimated from the iterates of the trial chain , and are not based on the past iterates of chain . We refer to the past iterates of used to estimate the proposal as the history vector, and denote by the history vector obtained after iteration , which is used to compute the proposal density at iteration . We consider the following general adaptive sampling algorithm.

Two-chain sampling algorithm.

1. Initialize the history , the proposal and initialize , of the trial chain and main chain , respectively.

2. For

• Update the trial chain:

• Generate a proposal .

• Compute

 α′n(z′,x′n−1,Hn−1)=min(1,π(z′)qHn−1(x′n−1|z′)π(x′n−1)qHn(z′|x′n−1)).
• Accept with probability and set , otherwise set .

• Set if is accepted, otherwise set .

• Update the main chain:

• Generate a proposal .

• Compute

 αn(z,xn−1)=min(1,π(z)qHn−1(xn−1|z)π(xn−1)qHn−1(z|xn−1)).
• Set with probability , otherwise set .

The ACMH sampler is based on this two-chain sampling framework and its convergence is justified by Corollary 3.

### 2.2 Convergence results

This section presents some general convergence results for adaptive MCMC. Suppose that is a sample space with . is a -field on . Suppose that is the proposal density used at the th iteration of the Metropolis-Hastings algorithm. In the two-chain algorithm above, is the density which is estimated based on the history . Let be the Markov chain generated by the Metropolis-Hastings algorithm and the distribution of the state with the initial state . Denote by the Markov transition distribution at the th iteration. We have the following convergence results whose proofs are in the Appendix.

###### Theorem 1 (Ergodicity).

Suppose that

 pi(xi−1,dxi)≥βΠ(dxi),for alli≥1, (1)

with . Then

 ∥Pn(x0,⋅)−Π(⋅)∥TV≤2(1−β)n→0asn→∞, (2)

for any initial , where denotes the total variation distance.

###### Theorem 2 (Strong law of large numbers).

Suppose that is a bounded function on and that (1) holds. Let . Then,

 Snn→EΠ(h)almost surely. (3)
###### Corollary 1.

Suppose that is bounded. Theorems 1 and 2 hold for each of the following cases.

1. with for all and .

2. The proposal density is a mixture of the form

 qi(z|x)=ωq1,i(z|x)+(1−ω)q2,i(z|x),0<ω<1,

and with for all .

3. Let and be transition distributions, each has stationary distribution . Suppose that is based on the proposal density , where for all and . The transition at the th iterate is a mixture of the form

 pi(xi−1,dxi)=ωp1,i(xi−1,dxi)+(1−ω)p2,i(xi−1,dxi),0<ω<1.
4. Let and be the transition distributions as in (iii). The transition at the th iterate is a composition of the form

 pi(xi−1,dxi)=p1,ip2,i(xi−1,dxi)=∫zp1,i(xi−1,dz)p2,i(z,dxi),

or

 pi(xi−1,dxi)=p2,ip1,i(xi−1,dxi)=∫zp2,i(xi−1,dz)p1,i(z,dxi),
5. Let and be the transition distributions as in (iii). The transition at the th iterate is a composition of repetitions of and repetitions of , i.e. or .

## 3 Reversible proposals

In the Metropolis-Hastings algorithm, it is desirable to have a proposal that depends on the current state, is reversible, and marginally has an invariant distribution of choice. We refer to such a proposal as a reversible proposal. The dependence between the current and proposed states helps in moving locally and helps the chain mix more rapidly and converge. As will be seen later, reversibility simplifies the acceptance probability in the MH algorithm and makes it close to one if the marginal distribution is a good approximation to the target. Reversibility also means that it is only necessary to be able to generate from the proposal distribution, and it is unnecessary to be able to evaluate it. This is important in our case because the proposal densities are mixtures of conditional densities with dependence parameters that are integrated out. Section 3.1 provides the theory for reversible proposals that we use in our article. Section 3.2 introduces a reversible multivariate density and a reversible mixture of multivariate densities. The proofs of all results in this section are in the Appendix.

### 3.1 Some theory for reversible proposals

###### Definition 1 (Reversible transition density).

Suppose that is a density in and is a transition density from to such that

 ζ(x)T(z|x)=ζ(z)T(x|z) \ for any x and z.

Then,

 ζ(z)=∫ζ(x)T(z|x)dx,

and we say that is a reversible Markov transition density with invariant density .

The following two lemmas provide some properties of reversible transition densities that are used in our work.

###### Lemma 1 (Properties of reversible transition densities).

Suppose that is a density in . Then, in each of the cases described below, is a reversible Markov transition density with invariant density .

• .

• Let be a partition of and define where is an indicator variable that is 1 if and is zero otherwise.

• Suppose that for each parameter value , is a reversible Markov transition density with invariant density . Let , where is a probability measure in .

The next lemma gives a result on a mixture of transition densities each having its own invariant density.

###### Lemma 2 (Mixture of reversible transition densities).
• Suppose that for each , is a reversible Markov transition kernel with invariant density . Define the mixture density and the mixture of transition densities as

 ζ(z)=G∑k=1ωkζk(z) and T(z|x)=G∑k=1ω(k|x)Tk(z|x),

where , and for all . Then, is a reversible Markov transition density with invariant density .

• If the invariant densities are all the same, then for all and .

• Suppose that is a reversible Markov transition density with invariant density . Then , , is a reversible Markov transition density with invariant density .

###### Corollary 2 (Mixture of conditional densities).

Let be a partition of and define for . Then, each is a reversible density with invariant density , and Lemma 2 holds.

The next lemma shows the usefulness of using reversible Markov transition densities as proposals in a Metropolis-Hastings scheme.

###### Lemma 3 (Acceptance probability for a reversible proposal).

Consider a target density . We propose , given the state from the reversible Markov transition density , which has invariant density . Then, the acceptance probability of the proposal is

 α(z,x)=min{1,π(z)ζ(x)π(x)ζ(z)}.

The lemma shows that the acceptance probability has the same form as for an independent proposal from the invariant density , even though the proposal may depend on the previous state and on parameters that are in the transition density but not in . This means the following: (i) To compute the acceptance probability it is only necessary to be able to simulate from and it is unnecessary to be able to compute it. This is useful for our work where we cannot evaluate analytically because it is a mixture over a parameter . (ii) The acceptance probability will be high if the invariant density is close to the target density . In fact, if , then the acceptance probability is 1.

### 3.2 Constructing reversible t distributions

Pitt and Walker (2006) construct a univariate Markov transition which has a univariate distribution as the invariant distribution. We now extend this approach to construct a Markov transition density with a multivariate density as its invariant distribution. This reversible multivariate process is new to the literature. We then generalize it to the case in which the invariant distribution is a mixture of multivariate distributions. We denote by the -variate density with location vector , scale matrix and degrees of freedom .

###### Lemma 4 (Reversible t transition density).

Let and , where is the set of parameters , is a correlation coefficient,

 ˜μ(x) =(1−ρ)μ+ρx,˜Σ(x)=νν+d(1−ρ2)(1+1ν(x−μ)′Σ−1(x−μ))Σ,˜ν=ν+d. (4)

Then,

• For each fixed , is a reversible Markov transition density with invariant density .

• Let where is a probability measure. Then, is a reversible Markov transition density with invariant density .

We now follow Lemma 2 and define a reversible transition density that is a mixture of reversible transition densities. Suppose and , where and are defined in terms of as in (4). Let and .

###### Lemma 5 (Mixture of t transition densities).

Let

 gM(z;ψ) =G∑k=1ωkζk(z;ψk) and TCMHgM(z|x;ψ)=G∑k=1ω(k|x)Tk(z|x;ψk), (5)

where . Then,

• is a mixture of densities and is a mixture of transition densities with a reversible transition density with invariant ;

• if the proposal density is and the target density is , then the Metropolis-Hastings acceptance probability is

 α(z,x) =min{1,π(z)π(x)gM(x;ψ)gM(z;ψ)}. (6)

We note that it is straightforward to generate from , given , because it is a mixture of transition densities, each of which is a mixture. However, it is difficult to compute because it is difficult to compute each of as it is a density mixed over . However, by Part (ii) of Lemma 5, it is straightforward to compute the acceptance probability .

### 3.3 Constructing reversible mixtures of conditional t densities

Suppose that the vector has density which is a mixture of multivariate densities as in equation (5), with . We partition as , where is and we partition the and conformally, as

 μk =(μk,Aμk,B)andΣk=(Σk,AAΣk,ABΣk,BAΣk,BB),k=1,…,G.

Then, , where , and .

###### Lemma 6 (Reversible mixture of conditional densities).

Define the transition kernel

 TBSgM(z|x;ψ)=G∑k=1ωkζk(x;ψk)gM(x;ψ)ζk(zA|zB;ψk)I(zB=xB). (7)

Then is reversible with invariant density .

## 4 The adaptive correlated Metropolis-Hastings (ACMH) sampler

The way we implemented the ACMH sampler is now described, although Sections 2.2 and 3 alow us to construct the sampling scheme in a number of ways. Section 4.1 outlines the Variational Approximation method for estimating mixtures of multivariate distributions. Sections 4.2 and 4.3 discuss component-wise sampling and adaptive random walk sampling. Section 4.4 summarizes the ACMH sampler.

### 4.1 Estimating mixtures of multivariate t densities

Given a mixture of multivariate densities, Section 3.2 describes a method to construct reversible mixtures of . This section outlines a fast Variational Approximation method for estimating such a mixture of .

Suppose that is the likelihood computed under the assumption that the data generating process of is a mixture of density , with its parameters. Let be the prior, then Bayesian inference is based on the posterior , which is often difficult to handle. Variational Approximation approximates this posterior by a more tractable distribution by minimizing the Kullback-Leibler divergence

 \rm KL(qva) =∫logqva(θ)p(θ|D)qva(θ)dθ ,

among some restricted class of densities . Because,

 logp(D) = ∫logp(θ)p(D|θ)qva(θ)qva(θ)dθ+∫logqva(θ)p(D|θ)qva(θ)dθ, (8)

minimizing is equivalent to maximizing

 L(λ) = ∫logp(θ)p(D|θ)qva(θ|λ)qva(θ|λ)dθ. (9)

Because of the non-negativity of the Kullback-Leibler divergence term in (8), (9) is a lower bound on . We refer the reader to Tran et al. (2012) who describe in detail how to fit mixtures of using Variational Approximation, in which the number of components is automatically selected using the split and merge algorithm by maximizing the lower bound. The accuracy of Variational Approximation is experimentally studied in Nott et al. (2012). See also Corduneanu and Bishop (2001) and McGrory and Titterington (2007) who use Variational Approximation for estimating mixtures of normals.

Denote by the maximizer of (9), the posterior is approximated by . From our experience, the estimate often has a small tail, but it can quickly locate the mode of the true posterior . In our context, with the mode of is used as the mixture of in the ACMH sampler.

We now explain more fully the main difference between the Variational Approximation approach to constructing proposal densities of de Freitas et al. (2001) and our approach. de Freitas et al. (2001) estimate directly as using Variational Approximation, i.e. minimizes

 \rm KL(πva)=∫logπva(x)π(x)πva(x)dx

among some restricted class of densities, such as normal densities. de Freitas et al. (2001) then use to form the fixed proposal density. The estimate often has much lighter tails than (see, e.g., de Freitas et al., 2001), therefore such a direct use of Variational Approximation estimates for the proposal density in MCMC can be problematic. Another problem with their approach is that needs to be derived afresh for each separate target density and this may be difficult for some targets. In our approach we act as if the target density is a mixture with parameters , and obtain a point estimate of using Variational Approximation. The approach is general because it is the same for all targets, and does not suffer from the problem of light tails.

### 4.2 Metropolis within Gibbs component-wise sampling

In high dimensions, generating the whole proposal vector at the one time may lead to a large difference between the values of the target , and the proposal , at the proposed and current states. This may result in high rejection rates in the Metropolis-Hastings algorithm, making it hard for the sampling scheme to move. To overcome this problem, we use Metropolis within Gibbs component-wise sampling in which the coordinates of are divided into two or more components at each iterate. Without loss of generality, it is only necessary to consider two components, a component that remains unchanged and a complementary component generated conditional on . We will refer to and as the index vectors of and respectively. Let and be the dimensions of and , respectively. We note that and can change at random or systematically from iteration to iteration. See Johnson et al. (2011) for a further discussion of Metropolis within Gibbs sampling.

We can carry out a Metropolis within Gibbs sampling step based on reversible mixtures of conditional distributions as in Section 3.3.

In some applications, there are natural groupings of the parameters, such as the group of mean parameters and the group of variance parameters. Otherwise, the coordinates can be selected randomly. For example, each coordinate is independently included in with probability . The number of coordinates in should be kept small in order for the chain to move easily. We find that it is useful to set so that the expected value of is about 10, i.e., .

### 4.3 The adaptive random walk Metropolis-Hastings proposal

Using mixtures of for the proposal distribution helps to quickly and efficiently locate the modes of the target distribution. In addition, it is useful to add some random walk Metropolis-Hastings steps to explore more effectively the tail regions around the local modes; see Section 6.1.

We use the following version of an adaptive random walk step, which takes into account the potential multimodality of the target. Let be the current state and the latest mixture of as in (5). Let , i.e. is the index of the component of the mixture that is most likely to belong to. Let be a -variate normal density with mean vector and covariate matrix . The random walk proposal density is , where if and is equal to otherwise. The scaling factor (see Roberts and Rosenthal, 2009).

### 4.4 Description of the ACMH sampler

This section gives the details of the ACMH sampler, which follows the framework in Section 2 to ensure convergence. The sampler consists of a reversible proposal density together with a random walk proposal. We shall first describe the reversible proposal density. Let be the heavy tailed component and the mixture of densities described in (5), where is the history vector obtained after iteration based on the trial chain and is the estimate of based on . Let be the reversible transition density whose invariant density is (see Part (i) of Lemma 1 ). Let be the correlated reversible transition density defined in equation (5) and let be the component-wise mixture reversible transition density defined in (7). We now define the mixtures,

 TgM(z|x;ˆψ(Hn−1)) =(1−γ)TCMHgM(z|x;ˆψ(Hn−1))+γTBSgM(z|x;ˆψ(Hn−1)) q∗(z;ˆψ(Hn−1)) =β0g0(z)+(1−β0)gM(z;ˆψ(Hn−1)) Tq∗(z|x;ˆψ(Hn−1)) =β0g0(x)q∗(x;ˆψ(Hn−1))Tg0(z|x)+(1−β0)gM(x;ˆψ(Hn−1))q∗(x;ˆψ(Hn−1))TgM(z|x,;ˆψ(Hn−1)),

and

 q(z|x;ˆψ(Hn−1)) =δq∗(z;ˆψ(Hn−1))+(1−δ)Tq∗(z|x;ˆψ(Hn−1)) =δβ0g0(z)+δ(1−β0)gM(z;ˆψ(Hn−1))+(1−δ)β0g0(x)q∗(x;ˆψ(Hn−1))Tg0(z|x)+ (10) +(1−δ)(1−β0)gM(x;ˆψ(Hn−1)q∗(x;ˆψ(Hn−1))TgM(z|x;ˆψ(Hn−1)), (11)

with . Note that is the probability of generating an independent proposal and is related to the probability of doing component-wise sampling. Then,

###### Lemma 7.
• is a reversible Markov transition density with invariant density .

• is a reversible Markov transition density with invariant density .

• is a reversible Markov transition density with invariant density .

• If is a proposal density with target density , then the acceptance probability is

 α(z,x;ˆψ(Hn−1)) =min{1,π(z)π(x)q∗(x;ˆψ(Hn−1))q∗(z;ˆψ(Hn−1))}.

Description of the ACMH proposal density. The ACMH sampler consists of a reversible proposal density together with a random walk proposal. Let be the transition kernel at iteration of the main chain . Denote by , the transition kernel with respect to the reversible proposal and the random walk proposal respectively.

• at (see Corollary 1 (iv)). That is, a composition of a correlated Metropolis-Hastings step with reversible proposal and a random walk step is performed after every iterations. In our implementation we take .

• In all the other steps, we take .

Convergence of the ACMH sampler. If we choose such that for some , then . By Corollary 1, we have a formal justification of the convergence of the ACMH sampler.

###### Corollary 3.

Suppose that

 g0(z)≥β0π(z)for allz∈E, (12)

for some . Then Theorems 1 and 2 hold for the ACMH sampler for any history .

For a general target , can be informally selected such that it is sufficiently heavy-tailed to make (12) hold. In Bayesian inference, is a posterior density that is proportional to with the prior and the likelihood. Suppose that the likelihood is bounded; this is the case if the maximum likelihood estimator exists. If is a proper density and we can generate from it, then we can set and it is straightforward to check that the condition (12) holds. The boundedness condition (12) is satisfied in all the examples in this paper.

We now briefly discuss the cost of running the two chain algorithm as in our article, compared to running a single chain adaptive algorithm. If the target density is inexpensive to evaluate, then the cost of running the two chain sampler is very similar to the cost of running just one chain because the major cost is incurred in updating the proposal distribution. If it is expensive to evaluate the target, then we can run the two chains in parallel on two (or more) processors. This is straightforward to do in programs such as Matlab because multiple processors are becoming increasingly common on modern computers.

Section 5 discusses the initial proposal , the history vector and .

We run the adaptive sampling scheme in two stages. Adaptation in the first stage is carried out more intensively by re-estimating the mixture of distributions after every 2000 iterations, and then every 4000 iterations in the second stage. When estimating the mixtures of in the first stage, we let the Variational Approximation algorithm determine the number of components. While in the second stage, we fix the number of components at that number in the mixture obtained after the first stage. This makes the procedure faster and helps to stabilize the moves. In addition, it is likely that the number of components is unchanged in this second stage.

#### 4.4.2 Selecting the control parameters

When the mixture of approximation becomes closer to the target, we expect the proposal to be close to . We can do so by setting as increases and setting a small value to . In our implementation we take , and a sequence as follows. Let be the length of the chain we wish to generate and suppose that . We set for and . In our implementation we take . For the correlation parameter , we simply select the probability measure as the distribution. These values were set after some experimentation. However, it is likely that we can further improve the efficiency of the ACMH sampler with a more careful (and possibly adaptive) choice of these control parameters.

## 5 Initial exploration

The purpose of the ACMH sampler is to deal with non-standard and multimodal target distributions. The sampler works more efficiently if the adaptive chain starts from an initial mixture distribution that is able to roughly locate the modes. We therefore attempt to initialize the history vector by a few draws generated approximately from by an algorithm that can explore efficiently the whole support of the target, and then estimate the initial mixture of based on these draws. Our paper uses simulated annealing (Neal, 2001) to initialize the sampler. An alternative is to use the Wang-Landau algorithm (Wang and Landau, 2001a, b). However, this algorithm requires the user to partition the parameter space appropriately which is difficult to do in many applications.

Simulated annealing. Simulated annealing works by moving from an easily-generated distribution to the distribution of interest through a sequence of bridging distributions. Annealed sampling has proved useful in terms of efficiently exploring the support of the target distribution (Neal, 2001). Let be some easily-generated distribution, such as a distribution, and a sequence of real numbers such that . A convenient choice is . Let

 ηt(x)=π0(x)1−ψtπ(x)ψt.

Note that is the initial distribution and is the target . We sample from this sequence of distributions using the sequential Monte Carlo method (see, e.g. Del Moral et al., 2006; Chopin, 2004), as follows.

1. Generate , where is the number of particles.

2. For

• Reweighting: compute the weights

 ˜wi=ηt(xi)ηt−1(xi),   wi=˜wi∑Npj=1˜wj.
• Resampling: sample from using stratified sampling. Let be the resampled particles.

• Markov move: for , generate , , where is a Markov kernel with invariant distribution , . is the burnin number.

• Set .

The above sequential Monte Carlo algorithm produces particles that are approximately generated from the target (Del Moral et al., 2006). We can now initialize the history vector using these particles and by the mixture of estimated from . Typically, should take a large value for multimodal and high-dimensional targets. In the default setting of the ACMH sampler, we set , and . The initial distribution is a multivariate distribution with location , scale matrix and 3 degrees of freedom. However, it is useful to estimate and from a short run of an adaptive random walk sampler, and we follow this approach in the real data examples.

In the default setting of the ACMH sampler, we select the heavy-tailed component as except that all the degrees of freedom of the component of are set to 1, so that the boundedness condition (12) is likely to be satisfied. However, in all the examples below, is context-specified to make sure that (12) holds.

## 6 Simulations

A common performance measure for an MCMC sampler is the integrated autocorrelation time (IACT). For simplicity, consider first the univariate case and let be the generated iterates from the Markov chain. Then the IACT is defined as

 IACT=1+2∞∑t=1ρt,

where is the autocorrelation of the chain at lag . Provided that the chain has converged, the mean of the target distribution is estimated by whose variance is

 Var(¯x)=σ2M(1+2M−1∑t=1(1−tM)ρt)≈σ2M(1+2∞∑t=1ρt)=IACT⋅σ2M,

where is the variance of the target distribution. This shows that the IACT can be used as a measure of performance and that the smaller the IACT, the better the sampler. Following Pitt et al. (2012), we estimate the IACT by

 ˆIACT=1+2L∗∑t=1ˆρt,

where are the sample autocorrelations, and , with the first index such that where is the sample size used to estimate . That is, is the lowest index after which the estimated autocorrelations are randomly scattered about 0. When we take, for simplicity, the average IACT over the coordinates, or the maximum IACT.

Another performance measure is the squared jumping distance, (see, e.g., Pasarica and Gelman, 2010, and the references in that paper). For the univariate case,

 Sq distance =1N−1