Principle of Detailed Balance and Convergence Assessment of Markov Chain Monte Carlo methods and Simulated Annealing

# Principle of Detailed Balance and Convergence Assessment of Markov Chain Monte Carlo methods and Simulated Annealing

Ioana A. Cosma and Masoud Asgharian111Ioana A. Cosma is a doctoral student in the Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, United Kingdom (email: cosma@stats.ox.ac.uk); Masoud Asgharian is Associate Professor, Department of Mathematics and Statistics, McGill University, Burnside Hall, 805 Sherbrooke W., Montreal, Quebec, Canada, H3A 2K6 (email: masoud@math.mcgill.ca). This research was partially supported by research grants from NSERC and FQRNT. The authors thank Russell Steele for insightful discussions on the topic.
###### Abstract

Markov Chain Monte Carlo (MCMC) methods are employed to sample from a given distribution of interest, , whenever either does not exist in closed form, or, if it does, no efficient method to simulate an independent sample from it is available. Although a wealth of diagnostic tools for convergence assessment of MCMC methods have been proposed in the last two decades, the search for a dependable and easy to implement tool is ongoing. We present in this article a criterion based on the principle of detailed balance which provides a qualitative assessment of the convergence of a given chain. The criterion is based on the behaviour of a one-dimensional statistic, whose asymptotic distribution under the assumption of stationarity is derived; our results apply under weak conditions and have the advantage of being completely intuitive. We implement this criterion as a stopping rule for simulated annealing in the problem of finding maximum likelihood estimators for parameters of a 20-component mixture model. We also apply it to the problem of sampling from a 10-dimensional funnel distribution via slice sampling and the Metropolis-Hastings algorithm. Furthermore, based on this convergence criterion we define a measure of efficiency of one algorithm versus another.

KEY WORDS: Metropolis-Hastings; slice sampling; Markov chain Central Limit Theorem; detailed balance; ergodic Markov chain; equilibrium; stationary distribution.

1. INTRODUCTION

Let be a given distribution such that either does not exist in closed form or no efficient method to simulate an independent sample from it is available. Suppose that interest lies in the expected value of a random variable , denoted by , where has distribution . Monte Carlo sampling methods (Hammersley and Handscomb 1964) such as rejection sampling, importance sampling or sampling-importance resampling (SIR) approximate the value of by sampling from a distribution that closely resembles (Smith and Gelfand 1992). Although for low dimensional distributions it is oftentimes possible to find sampling distributions that provide estimates to within given accuracy with low computational cost, these sampling methods suffer greatly from the curse of dimensionality.

The need to approximate the value of high dimensional integrals arising in statistical mechanics led to the development of MCMC sampling methods. The first MCMC method, known today as the Metropolis Monte Carlo algorithm, was proposed by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) as a general method for studying the equilibrium properties of systems consisting of many interacting particles. The algorithm simulates the behaviour of the system under equilibrium, and the expected value of a given property is approximated by ergodic averages based on these simulations. In statistical terms, the Metropolis Monte Carlo algorithm constructs an ergodic Markov chain with stationary distribution , i.e. as the number of iterations tends to , the conditional distribution of given the value of converges to regardless of the starting distribution , where has distirubtion (in notation: ).

Hastings (1970) generalized the procedure of proposing the next move given . His algorithm, known as the Metropolis-Hastings algorithm, transforms an arbitrary stochastic matrix into a -reversible one, and only requires that be known up to a normalizing constant. An equally popular MCMC algorithm is the Gibbs sampler, introduced by Geman and Geman (1984) with an application to image restoration. This algorithm proposes the next move by sampling from the full conditional distributions and, unlike the Metropolis-Hastings algorithm, accepts each proposal with probability 1. Two well-known variants on Gibbs sampling are the data-augmentation algorithm of Tanner and Wong (1987) and the substitution sampling algorithm of Gelfand and Smith (1990).

The goal of MCMC methods is to produce an approximate i.i.d. sample
from , where , , and is known as the number of ‘burn-in’ iterations to be removed from the beginning of the chain. Analysing the output of an MCMC method consists of assessing convergence to sampling from , convergence to i.i.d. sampling, and convergence of empirical averages of the form to as . Robert and Casella (2004) argue that while convergence to is not of major concern since it can only be achieved asymptotically, the issues of convergence to i.i.d. sampling and of convergence of empirical averages are strongly interrelated and depend on the mixing speed of the chain. By definition, a chain whose elements converge rapidly to weakly correlated draws from the stationary distribution is said to possess good mixing speed. Therefore, the mixing speed of a chain is determined by the degree to which the chain escapes the influence of the starting distribution and by the extent to which it explores the high density regions of the support of .

Recent research in MCMC methodology has focused on developing, on one hand, samplers that escape quickly the attraction of the starting distribution as well as that of local modes, and, on the other hand, convergence assessment criteria for analysing the mixing speed of a given chain. A recent sampling algorithm which exploits the idea of jumping between states of similar energy to facilitate efficient sampling is the equi-energy sampler of Kou et al.(2006). Robert (1995,1998), Cowles and Carlin (1996), and Brooks and Roberts (1998) present a comprehensive review of the practical implementation of convergence criteria and the mathematics underlying them. Liu (2001), Neal (1993), Brooks (1998), and Kass, Carlin, Gelman, and Neal (1998) offer an in-depth introduction to MCMC methodology and its applications, as well as discussions on the issues surrounding it.

The common view among researchers and practitioners is that developing a good sampler or a reliable convergence criterion is problem-specific. A sampler with good mixing speed when sampling from a relatively smooth, low-dimensional distribution might become trapped in a well of low probability when sampling from a distribution having many local modes. Similarly, a convergence criterion which proves reliable for analysing a given MCMC output might incorrectly assess the convergence of a chain that has only explored a subset of the entire support space. Our interest lies in convergence assessment, in particular, in identifying lack of convergence. We define a one-dimensional statistic and derive an intuitive criterion based on the principle of detailed balance that provides a qualitative assessment on the convergence of a given MCMC chain.

In Section 2 we recall basic notions and results from the theory of Markov chains, which we subsequently use in Section 3 to derive the asymptotic distribution of our proposed statistic under the assumption of stationarity. In the same section, we discuss two possible implementations of our criterion, one using the asymptotic distribution, the other experimental as a qualitative tool. Section 4 discusses two applications: one as a stopping rule for simulated annealing, an algorithm for function maximization applied to the problem of finding maximum likelihood estimators (Azencott 1992), the second as a graphical tool for comparing the performances of Metropolis-Hastings versus slice sampling for the problem of sampling from a 10-dimensional funnel distribution. All computations were performed using code written in C++. We conclude in Section 5 with general remarks, comparisons, and criticisms.

2. PRELIMINARIES

Let be a Markov chain with state space and transition probability matrix . We refer the reader to Medhi (1994), Norris (1997), and Jones (2004) for details and proofs. For the purpose of the convergence criterion we present in this article, we restrict our attention to finite Markov chains.

Let be the transition probability from state to state in steps. The Ergodic Theorem states that if is irreducible and aperiodic, then the limits exist and are independent of the initial state for all and is the stationary distribution of . The chain is called ergodic.

###### Definition 1

(Principle of detailed balance) Transition probability matrix and probability distribution are said to be in detailed balance, or, equivalently, the principle of detailed balance is said to hold, if

###### Definition 2

A Markov chain with irreducible transition probability matrix and initial distribution , i.e. , is reversible if, for all , the chain is a Markov chain with transition probability matrix and initial distribution .

Norris (1997) proves that if is irreducible, then it is reversible if and only if and are in detailed balance, where is the initial distribution of . The following definitions are needed to introduce the Markov chain Central Limit Theorem (Jones 2004).

###### Definition 3

Let be a nonnegative function and a nonnegative decreasing function on the positive integers such that

 ∥Pn(i,⋅)−π(⋅)∥≤M(i)γ(n). (1)

Let be a Markov chain on state space with transition probability and stationary distribution . If (1) holds for all with for some , then is geometrically ergodic. If, moreover, is bounded, then is uniformly ergodic. If (1) holds for all with for some , then is polynomially ergodic of order .

###### Theorem 1

The Central Limit Theorem (finite state space) Let be an ergodic Markov chain on state space with stationary distribution . Let be a Borel function. Assume that one of the following conditions holds:

1. is polynomially ergodic of order , and there exists such that almost surely;

2. is polynomially ergodic of order , and where ;

3. is geometrically ergodic and for some ;

4. is geometrically ergodic and ;

5. is geometrically ergodic, satisfies detailed balance and ;

6. is uniformly ergodic and .

Then for any initial distribution,

 √n(¯hn−Eπ(h(X)))\lx@stackrelD→Normal(0,σ2h) as n→∞,

where and

3. DETAILED BALANCE AND CONVERGENCE DIAGNOSTICS

Let be a discrete distribution with finite state space , . Let be an irreducible, aperiodic Markov chain with transition probability matrix and stationary distribution . We say that a chain has reached equilibrium by step if , and such that . Our convergence assessment criterion is based on the principle of detailed balance from statistical mechanics (Chandler 1987). Statistical mechanics is concerned with the study of physical properties of systems consisting of very large number of particles, for example liquids or gases, as these systems approach the equilibrium state, i.e. a uniform, time-independent state. In these terms, the principle of detailed balance states that a physical system in equilibrium satisfies

 πiπj=pjipij=exp(−Ei−EjkT), ∀i,j∈S,

where is the energy of the system in state , is Boltzmann’s constant, is the temperature, and and have the usual interpretation.

We assume that the Markov chain is constructed to satisfy detailed balance. This is oftentimes the case since the principle of detailed balance implies that is the stationary distribution of the chain, and it is easier to check the former than the latter, see for example the discussions on the Metropolis-Hastings (Hastings 1970) and slice sampling algorithms (Neal 2003). We introduce the notion of an energy function , . When implementing simulated annealing, the stationary distribution at temperature is , so the energy function becomes , where is a sequence of decreasing temperatures. Therefore, the equilibrium probability of being in state equals where the normalizing constant is defined as . Define the following approximation to based on a Markov chain of iterations

 ^πi=1nn∑j=1I(Xj=i), ∀i∈S.

The idea of working with indicator functions is similar to that of Raftery and Lewis (1992) who develop a convergence assessment method based on the sequence , for fixed . We point out that, for fixed , the sequence forms a Markov chain, whereas the sequence defined by Raftery and Lewis does not. Brooks et al.(2003) use a similar approach of estimating the stationary distribution by the empirical distribution function obtained from the MCMC output; they derive nonparametric convergence assessment criteria for MCMC model selection by monitoring the distance, as the number of simulations increases, between the empirical mass functions obtained from multiple independent chains.

Our criterion assesses the convergence of the chain by comparing the behaviour of the functions via the statistic

3.1 Theoretical approach

We proceed to derive the distribution of the statistic under the hypothesis that the chain has reached stationarity, i.e. that .

 Vn=nm∑i∈S{fi−1m∑j∈Sfj}2=nm∑i∈S{fi−1mfi−1m∑\lx@stackrelj∈Sj≠ifj}2=nm∑i∈S{ai′f}2,

where and is an -dimensional column vector with th entry equal to and the remaining entries equal to . Define the following dimensional matrix

 A=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝a1′a2′⋮⋮am′⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1−1m−1m−1m…−1m−1m1−1m−1m…⋮⋮⋮⋱⋮⋮−1m……1−1m−1m−1m……−1m1−1m⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

so .

First, we observe that ,

 (2)

since . Second, we notice that

 fi−Eπfi=^πie−Ei−1Z=^πi−πiZπi, ∀i∈S. (3)

Define , and the -dimensional column vector . From (2) and (3), we obtain that where

 C=A⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1√mZπ10…001√mZπ20⋮⋮⋮⋱⋮0…01√mZπm⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝m−1m3/2e−E10…00⋱0⋮⋮0⋱⋮0…0m−1m3/2e−Em⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

The following result presents the asymptotic distribution of the statistic under the assumption of stationarity.

###### Theorem 2

Under the conditions of Theorem 1, and as , where are the characteristic roots of and are i.i.d.  random variables.

proof: We begin by pointing out that irreducible and aperiodic Markov chains on finite state spaces are uniformly ergodic (Roberts and Rosenthal 2004), so condition (6) of Theorem 1 is satistifed. It follows that for every ,

 Wi,n=√n(^πi−πi)=√n{1nn∑j=1I(Xj=i)−Eπ(I(X1=i))}\lx@stackrelD→Normal(0,σ2i)

as , where

 σ2i=πi(1−πi)+2∞∑j=2[P{I(Xj=i)=1|I(X1=i)=1}πi−π2i]<∞.

By the Cramér-Wold Device (Billingsley 1968, Varadarajan 1958), it follows that as , where is an -dimensional column vector of zeros and is an variance-covariance matrix whose entries are given

 Σ(i,i) = σ2i Σ(i,j) = limn→∞covπ(Wi,n,Wj,n)=limn→∞{1nn∑k=1n∑l=1covπ(I(Xk=i),I(Xl=j))} = −πiπj]+1n∑\lx@stackrelk,l=1l

So, for all ,

 Σ(i,j) = −πiπj+limn→∞πin{n∑\lx@stackrelk,l=1k

The last equality follows from the fact that if a Markov chain satisfies detailed balance, then it is reversible, i.e. for , . Finally, the conditions of the Markov chain Central Limit Theorem guarantee that the infinite summation in the last line is finite.

It then follows that as . Lastly, since , it follows from Lemma 1 in Chernoff and Lehmann (1953) that as , where are the characteristic roots of and are i.i.d.  random variables.

Q.E.D.

###### Example 1

Let the Markov chain be generated by the Metropolis-Hastings algorithm with symmetric proposal probability matrix . The expressions for and can be simplified as follows. Consider the Markov-Bernoulli chain for fixed with transition probability matrix It is shown in Medhi (1994, pp. 101-102) that

 Pj−1i=1a+b(baba)+(1−a−b)j−1a+b(a−a−bb), ∀j≥2.

Now,

 a = b = 1−P{X2=i|X1=i}=1−pii.

Then, provided that ,

 Σ(i,i) = πi(1−πi)+2∞∑j=2πi(1−πi)(pii−πi1−πi)j−1=πi(1−πi)(1+pii−2πi)1−pii, Σ(i,j) = −πiπj+2πi∞∑k=2(Pk−1(i,j)−πj),for i≠j.

3.2 Implementation

Let be an irreducible and aperiodic Markov chain with finite state space and stationary distribution that satisfies detailed balance. A burn-in of draws are discarded, where depends on the rate of convergence of the sampling algorithm on (Brooks 1998). We implement our convergence assessment criterion as a test of hypothesis under the null hypothesis that the chain has reached stationarity by iteration .

For large enough, , and we estimate its distribution using Lyapunov’s Central Limit Theorem (Loève 1963). Since is , is , so and , for . Define ; , and for . Moreover, , so , for . Define . It remains to show that the following condition holds: which is equivalent to showing that

 limk→∞1(2∑ki=1λ2i)3/2k∑i=1|λi|3=0, (4)

since , for . So, provided that condition (4) is satisfied, Lyapunov’s Central Limit Theorem gives the following result for and large enough:

 Vn\lx@stackrelD=k∑i=1λiZ2i∼Normal(k∑i=1λi,2k∑i=1λ2i) approximately. (5)

For the computation of the mean and variance in (5), we resort to the following simplifications

 k∑i=1λi = trace(CΣC′)=m∑i=1[C(i,i)]2Σ(i,i), (6) k∑i=1λ2i = (k∑i=1λi)2−2k∑\lx@stackreli,j=1i

where the first summation in equation (7) is given in (6), and the second is the sum of all the 2-square principal subdeterminants of (Marcus and Ming 1964, p. 22).

We propose a quantitative assessment of convergence via a test of hypothesis at confidence level using the approximate distribution of given in (5) as follows.

1. Obtain an aperiodic, irreducible Markov chain which satisfies the principle of detailed balance: ; discard the first draws.

2. Compute the statistic from the remaining draws and the quantile .

3. If , conclude that the chain has reached stationarity at level and stop; else, continue for an additional iterations and return to step 2, replacing by .

In this article we implement the criterion in the form of a qualitative tool for convergence assessment. We iterate the chain and plot the absolute value of the relative difference, , against the number of iterations , every iterations, . We claim that the chain has reached equilibrium if the relative difference drops below some problem-specific, pre-specified constant . The value of the constant is problem-specific because it depends on the distribution of interest . For a high-dimensional, multi-modal distribution, the value of might need to be very small in order for this analysis to correctly detect lack of convergence to , whereas the same value might be too conservative for a one-dimensional, unimodal distribution.

Based on this implementation of the criterion as a qualitative tool, we can define a measure of efficiency of one algorithm against another. Let be given. Let be the value of the statistic after iterations of algorithm . Let represent the interval, in iterations, at which the statistic is computed for algorithm . The measure of efficiency is defined as

 V(ϵ)1,2=min{kn1:∣∣(V(1)(k−1)n1−V(1)kn1)/V(1)(k−1)n1∣∣<ϵ}min{kn2:∣∣(V(2)(k−1)n2−V(2)kn2)/V(2)(k−1)n2∣∣<ϵ}.

If , we conclude that algorithm 1 is more efficient than algorithm 2 at level ; if , algorithm 2 is more efficient than algorithm 1.

4. APPLICATIONS

4.1 Application 1: multipath changepoint problem

The following application is taken from Asgharian and Wolfson (2001). Let denote the th measurement on patient , where , . To each patient there is associated a possibly distinct changepoint such that measurements are i.i.d.  random variables and measurements are i.i.d. . Let and denote the covariate vector and the regression coefficient vector, respectively, for patient , i.e. . Define parameters and . The goal is to find the maximum likelihood estimators (MLE’s) of and , denoted by and , respectively. We simulate the data with and ; the joint log likelihood is bimodal. We let the parameter space be , assuming zero mass is placed outside this region, and we discretize the space over a grid of width .

We apply the algorithm of simulated annealing, introduced by Kirkpatrick, Gelatt, and Vecchi (1983), which performs function optimization through an iterative improvement approach. The algorithm was developed via an analogy with thermodynamics where a substance is melted by a slow annealing process and equilibrium is attained at each temperature until eventually the substance stabilizes at its lowest-energy state. Similarly, in simulated annealing, a global temperature parameter controls the effects of high probability regions under the distribution of interest . For each in a sequence such that as , an MCMC chain with stationary distribution is generated until equilibrium. As the temperature is lowered following a pre-specified schedule, known as the cooling schedule, the effects become more pronounced and the chain stabilizes at its global maximum value or equivalently, lowest energy state (Neal 1993, Brooks and Morgan 1995). Geman and Geman (1984) show that this convergence is guaranteed under a logarithmic cooling schedule, which unfortunately is too slow to be followed in practice.

We implement the algorithm with a geometric cooling schedule , and and zero burn-in. Simulated annealing with a very fast cooling schedule is known as simulated quenching; refer to Catoni (1992) for a discussion on the design of cooling schedules. For , the function at temperature is given by

The aim is to compare the performance of the Metropolis-Hastings sampler in determining the MLE’s via simulated annealing with two different methods for proposing the next move. In the first method, we draw uniformly from a cube of length centered at the current position, where has the values: for . These values are set retrospectively to obtain an acceptance rate of approximately . In the second method, we propose the next move via univariate slice sampling applied to each variable in turn; this algorithm is described briefly in Subsection 4.2. We use the “stepping-out” procedure with an initial interval size of at each temperature.

At each temperature, we perform 1000 iterations of the Metropolis-Hastings algorithm, computing the value of every iterations. We obtain the following results: , , and , for the first and second methods, respectively, which equal the lowest energy value obtained by a systematic grid search. We conclude that both methods correctly identified the MLE’s. Figures 1 and 2 display the relative difference in variance; sharp drops indicate that the sampler has jumped to previously unexplored regions of the parameter space, i.e. to points for which is significantly different from , thus increasing the value of the variance.

We proceed to simulate 50 datasets; for each, we initialize the two chains from the same randomly chosen point. At each temperature level, we compute the value of every iterations until , with . We remark that this value of is very conservative; ideally, a different value would be employed at each temperature level. We make the following two observations: first, for any given dataset, the lowest energy values reported by the two algorithms differ by at most units in magnitude, and, second, the difference between the lowest energy values found by a systematic search and by simulated annealing is at most . Moreover, we note that the methods required on average 5605 iterations, and 3162 iterations, respectively. Averaged over 50 tests, the measure of efficiency of simulated annealing using Metropolis-Hastings with uniform proposals versus Metropolis-Hastings with slice sampling is approximately , i.e. MCMC with slice sampling is almost twice as efficient as MCMC with uniform proposals.

4.2 Application 2: 10-dimensional funnel

Neal (2003) illustrates the advantage of slice sampling over Metropolis-Hastings in sampling from a 10-dimensional funnel distribution. Slice sampling is an adaptive MCMC method which proceeds in two alternating steps. Given the current position , it samples a value uniformly from the interval . Given , the next position is sampled from an appropriately chosen subset of the horizontal “slice” . Neal (2003) shows that the algorithm produces an ergodic Markov chain with stationary distribution , and that, moreover, due to its adaptive nature, the algorithm sometimes outperforms Metropolis-Hastings and the Gibbs sampler.

Let be a random variable, and let be independent random variables, which, conditional on , have mean 0 and variance . The goal is to obtain an approximate independent sample from the joint distribution of . We initialize the chain as follows: and , for . For each variable, the parameter space is taken to be and it is discretized over a grid of width .

First, we implement the Metropolis-Hastings algorithm with single-variable updates applied to each variable in sequence; one iteration of the chain consists of 1300 updates. For each variable, the proposal distribution is , centered at the current value, with standard deviation of , truncated on the interval . Numbers are rounded to the closest value on the grid. Second, we implement the slice sampling algorithm with single-variable updates; each iteration consists of 120 updates for each variable in sequence. We use the “stepping-out” procedure with an initial interval of size . We compute every iterations until the absolute value of the relative difference is below .

The left column of Figure 3 compares the histograms of the sampled values of with the true probability distribution function; the histograms are based on chains of 4600 and 17200 iterations, respectively. Metropolis-Hastings oversamples negative values of and undersamples positive ones; slice sampling samples correctly in the left tail of the distribution, but undersamples positive values. The right column displays the behaviour of the relative difference in ; the variance function undergoes sharp increases in value under both sampling methods, but stabilizes towards the end of the run. The behaviour of the variance function fails to reflect the incorrect sampling in the tails of the distribution. The plot of the relative difference in variance for the Metropolis-Hastings algorithm indicates that a smaller value of would be more appropriate for assessing convergence. The plots in Figure 4 show that the autocorrelation obtained by slice sampling remains close to zero after 100 iterations, whereas that obtained by Metropolis-Hastings continues to fluctuate even after 1000 iterations. This indicates that the Metropolis-Hastings algorithm converges more slowly than slice sampling. We compute the Raftery and Lewis (1992) convergence diagnostic using the Coda package in R (http://www.r-project.org) obtaining dependence factors of 14 and 18.7 for the Metropolis-Hastings and the slice sampling algorithms, respectively, indicating strong autocorrelation.

Finally, we run eleven parallel chains started from the following quantiles of the marginal distribution of ; we employ the value . We expect the parameter space to be insufficiently explored by both algorithms; however, we are interested in whether this insufficient exploration can be detected from the behaviour of across chains with overdispersed starting points. Pooling the sampled values results in chains of 30800 and 19800 draws, respectively; thus the measure of efficiency of Metropolis-Hastings versus slice sampling is 1.56. Trace plots and histograms indicate that negative values of are oversampled and positive ones are undersampled by both algorithms. Figure 5 is obtained by pooling the sampled values across the eleven chains; the behaviour of under slice sampling poses signs of concern regarding convergence to stationarity (notice the frequent increases in value from iteration 17500 onwards), whereas the value of under Metropolis-Hastings appears stable towards the end of the run. Therefore the behaviour of under slice sampling across eleven chains with overdispersed starting points indicates lack of convergence to stationarity, whereas the behaviour of under Metropolis-Hastings, which is known to allow a more restrictive exploration of the support space, gives misleading results.

5. CONCLUSION

The last fifty years have witnessed the development and rise in popularity, in particular in Bayesian statistical inference, of Markov Chain Monte Carlo methods for simulating from complex probability distributions (Smith and Roberts 1993). For a practitioner who has a finite MCMC output, questions arise regarding how reliable the sample is as a representation of . Although a wealth of convergence diagnostic tools for analysing MCMC output have been proposed over the past decades, their performance, in general, is problem-specific, and developing a dependable, easy to implement tool for convergence assessment continues to be a challenge. This article presents a new convergence assessment method for irreducible, aperiodic Markov chains on discrete spaces obtained by MCMC samplers that satisfy the principle of detailed balance and requirement (4). We introduce a one-dimensional test statistic whose behaviour under the assumption of stationarity is analyzed both theoretically and experimentally, and present a possible implementation of our criterion as a graphical tool for convergence assessment.

In low dimensional problems, the proposed criterion as a qualitative tool assesses convergence satisfactorily; however, in high dimensional problems, the criterion is unreliable for convergence assessment, but can provide useful insight into lack of convergence of the chain to stationarity. In particular, if the variance function experiences sharp increases in value, then it can be concluded that stationarity has not yet been reached; however, if the value of the variance function is stable, then the results are inconclusive. The advantage of our method lies in its attempt to analyse the behaviour of an MCMC chain travelling through a possibly high dimensional space by monitoring the behaviour of a one-dimensional statistic. Lack of convergence to stationarity is correctly assessed by the behaviour of the statistic to the extent to which the sampler explores freely the underlying space. Particularly in high dimensional problems with irregularly shaped distribution functions, we recommend that the MCMC output be analyzed using different values, compared across multiple chains, and that several diagnostic tools be employed.

There exist in the literature at least two convergence assessment criteria based on weighting functions that are very similar to our approach. Ritter and Tanner (1992) propose to detect convergence to the full joint distribution by monitoring convergence of the importance weight , where is the joint distribution of the observations sampled at iteration . They estimate by , where is a sample from . If the chain has converged, the distribution of the weights , based on multiple replications of the chain, will be degenerate about a constant. Zellner and Min (1995) propose a convergence criterion for the Gibbs sampler in the special case that can be partitioned into . They define two criteria based on the weight functions and , where is estimated by , and is the sequence of draws of obtained by Gibbs sampling. They compute the value of these weights at many points in the parameter space and argue that if the chain has converged, then the values of will be close to 0 and those of close to 1. Zellner and Min use asymptotic results from the stationary time series literature to calculate posterior odds for the hypothesis vs. for the -dimensional case, , when the weights are computed at different points in the parameter space.

The main drawback of these methods is the assumption that the transition probability , in the method of Ritter and Tanner, and the conditionals and , in the method of Zellner and Min, exist explicitly. Our method, however, makes no such assumption and estimates , the probability of being in state , by the empirical distribution function. All three methods have the disadvantage of being computationally expensive; the ergodic averages used to approximate various marginal and conditional probabilities (in our method, ) require a large number of summands in order to provide good estimates, so large numbers of iterations, and possibly many replicates of the chain, are needed. Furthermore, since the normalizing constant of is unknown, the functions and the weights of the criterion of Ritter and Tanner might stabilize around an incorrect value if the sampler has failed to explore all the high density regions of the space. For this reason, we recommend to run multiple replicates of the chain started from different regions of the space. The criterion of Zellner and Min also gives misleading results if the space is poorly explored and the weights are computed at points that come from low density regions. Finally, our criterion has an intuitive graphical representation, very similar to that proposed by Ritter and Tanner, and, whereas the criterion of Zellner and Min uses multivariate weight functions, our criterion is based on a one-dimensional statistic regardless of the dimension of the underlying space, thus offering a dimensionality reduction approach to the problem of convergence assessment in high dimensional spaces.

An interesting alternative to approximating a continuous state space by a discrete grid is to sample the continuous state-space Markov chain and to apply the discretization method developed by Guihenneuc-Jouyaux and Robert (1998). Provided that the continous chain is Harris-recurrent, the method defines renewal times based on the visiting times to one of disjoint small sets in the support space. By subsampling the underlying chain at the renewal times, the method builds a homogeneous Markov chain on the finite state space . Our propoposed criterion can then be applied to the finite chain; it would be interesting to explore whether the convergence assessment extends to the continous Markov chain.

## References

• [1] Asgharian, M. and Wolfson, D. B. (2001) “Modeling covariates in multipath changepoint problems: Modeling and consistency of the MLE,” The Canadian Journal of Statistics, 29, 4, 515-528.
• [2] Azencott, R. (ed.) (1992) Simulated annealing: parallelization techniques, New York: Wiley.
• [3] Billingsley, P. (1968) Convergence of Probability Measures, New York: John Wiley & Sons, Inc.
• [4] Brooks, S. P. (1998) “Markov chain Monte Carlo method and its application,” The Statistician, 47, 69-100.
• [5] Brooks, S. P., Giudici, P., and Philippe, A. (2003) “Nonparametric Convergence Assessment for MCMC Model Selection,” Journal of Computational and Graphical Statistics, 12, 1, 1-22.
• [6] Brooks, S. P., and Morgan, B. J. T. (1995) “Optimization using simulated annealing,” The Statistician, 44, 241-257.
• [7] Brooks, S. P., and Roberts, G. O. (1998) “Convergence assessment techniques for Markov chain Monte Carlo,” Statistics and Computing, 8, 319-335.
• [8] Catoni, O. (1992) “Rough large deviation estimates for simulated annealing: application to exponential schedules,” The Annals of Probability, 20, 3, 1109-1146.
• [9] Chandler, D. (1987) Intoduction to Modern Statistical Mechanics, New York: Oxford University Press.
• [10] Chernoff, H. and Lehmann, E. L. (1953) “The use of maximum likelihood estimates in tests for goodness of fit,” The Annals of Mathematical Statistics, 25, 3, 579-586.
• [11] Cowles, M. K., and Carlin, B. P. (1996) “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review,” Journal of the American Statistical Association, 91, 883-904.
• [12] Gelfand, A. E., and Smith, A. F. M. (1990) “Sampling-Based Approaches to Calculating Marginal Densities,” Journal of the American Statistical Association, 85, 398-409.
• [13] Geman, S., and Geman, D. (1984) “Stochastic Relaxation, Gibbs Distribution, and the Bayesian Restoration of Images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741.
• [14] Guihenneuc-Jouyaux, C., and Robert, C. P. (1998) “Discretization of Continuous Markov Chains and Markov Chain Monte Carlo Convergence Assessment,” Journal of the American Statistical Association, 93, 443, 1055-1067.
• [15] Hastings, W. K. (1970) “Monte Carlo sampling methods using Markov chains and their applications,” Biometrika, 55, 97-109.
• [16] Hammersley, J. M., and Handscomb, D. C. (1964) Monte Carlo methods, London: Methuen.
• [17] Jones, G. (2004) “On the Markov chain central limit theorem”, Probability Surveys, 1, 299-320.
• [18] Kass, R. E., Carlin, B. P., Gelman, A., and Neal, R. M. (1998) “Markov Chain Monte Carlo in Practice: A Roundtable Discussion,” The American Statistician, 52, 93-100.
• [19] Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. (1983) “Optimization by Simulated Annealing,” Science, 220, 671-680.
• [20] Kou, S. C., Zhou, Q., and Wong, W. H. (2006) “Equi-energy Sampler with Applications in Statistical Inference and Statistical Mechanics,” The Annals of Statistics, 34, 4, 1581-1619.
• [21] Loève, M. (1963) Probability Theory, Toronto: D. Van Nostrand Company (Canada), Ltd.
• [22] Liu, J. S. (2001) Monte Carlo strategies in scientific computing, New York: Springer.
• [23] Marcus, M., and Ming, H. (1964) A survey of matrix theory and matrix inequalities, New York: Dover Publications, Inc.
• [24] Medhi, J. (1994) Stochastic Processes, New Delhi: New Age International (P) Ltd., second edition.
• [25] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) “Equation of State Calculations by Fast Computing Machines,” The Journal of Chemical Physics, 21, 1087-1092.
• [26] Neal, R. M. (1993) Probabilistic Inference Using Markov Chain Monte Carlo Methods, Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto.
• [27]  (2003) “Slice Sampling,” The Annals of Statistics, 31, 3, 705-767 (with discussion and a rejoinder by the author).
• [28] Norris, J. R. (1997) Markov Chains, New York: Cambridge University Press.
• [29] Raftery, A. E., and Lewis, S. (1992) “How Many Iterations in the Gibbs Sampler?”, in Bayesian Statistics 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, U.K.: Oxford University Press, 763-773.
• [30] Ritter, C., and Tanner, M. A. (1992) “Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-Gibbs Sampler,” Journal of the American Statistical Association, 87, 861-868.
• [31] Robert, C. P. (1995) “Convergence Control Methods for Markov Chain Monte Carlo Algorithms,” Statistical Science, 10, 3, 231-253.
• [32]  (ed.) (1998) Discretization and MCMC Convergence Assessment, Lecture Notes in Statistics, 135, New York: Springer.
• [33] Robert, C. P., and Casella, G. (2004) Monte Carlo Statistical Methods, New York: Springer-Verlag, second edition.
• [34] Roberts, G. O., and Rosenthal, J. S. (2004) “General state space Markov chains and MCMC algorithms,” Probability Surveys, 1, 20-71.
• [35] Smith, A. F. M., and Gelfand, A. E. (1992) “Bayesian Statistics Without Tears: A Sampling-Resampling Perspective,” The American Statistician, 26, 84-88.
• [36] Smith, A. F. M., and Roberts, G. O. (1993) “Bayesian Computation via the Gibbs Sampler and Related Markov Chain Monte Carlo Methods,” Journal of the Royal Statistical Society, Ser. B, 55, 3-23.
• [37] Tanner, M. A., and Wong, W. H. (1987) “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, 82, 528-540.
• [38] Varadarajan, V. S. (1958) “A Useful Convergence Theorem”, Sankhya, 20, 221-222.
• [39] Zellner, A., and Min, C. (1995) “Gibbs Sampler Convergence Criteria,” Journal of the American Statistical Association, 90, 921-927.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters