On the Geometric Ergodicity of Two-Variable Gibbs Samplers

# On the Geometric Ergodicity of Two-Variable Gibbs Samplers

\fnmsAixin \snmTan\correflabel=e1]aixin-tan@uiowa.edu [ Department of Statistics and Actuarial Science University of Iowa Iowa City, IA 52242 \printeade1 University of Iowa    \fnmsGalin L. \snmJoneslabel=e2]galin@umn.edu [ School of Statistics University of Minnesota Minneapolis, MN 55455 \printeade2 University of Minnesota    \fnmsJames P. \snmHobertlabel=e3]jhobert@stat.ufl.edu [ Department of Statistics University of Florida Gainesville, FL 32611 \printeade3 University of Florida
###### Abstract

A Markov chain is geometrically ergodic if it converges to its invariant distribution at a geometric rate in total variation norm. We study geometric ergodicity of deterministic and random scan versions of the two-variable Gibbs sampler. We give a sufficient condition which simultaneously guarantees both versions are geometrically ergodic. We also develop a method for simultaneously establishing that both versions are subgeometrically ergodic. These general results allow us to characterize the convergence rate of two-variable Gibbs samplers in a particular family of discrete bivariate distributions.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Geometric Ergodicity

\thankstext

t1Jones is partially supported by the National Institutes of Health. Hobert is partially supported by the National Science Foundation.

class=AMS] \kwd[Primary ]60J10 \kwd[; secondary ]62F15

Geometric ergodicity, Gibbs sampler, Markov chain, Monte Carlo

## 1 Introduction

Let be a probability distribution having support , and and denote the associated conditional distributions. We assume it is possible to simulate directly from and . Then there are two Markov chains having as their invariant distribution, each of which could be called a two-variable Gibbs sampler (TGS). The most common version of a TGS is the deterministic scan Gibbs sampler (DGS), which is now described. Suppose the current state of the chain is , then the next state, , is obtained as follows.

Iteration of DGS:

1. Draw , and call the observed value .

2. Draw .

An alternative TGS is the random scan Gibbs sampler (RGS). Fix and suppose the current state of the RGS chain is . Then the next state, , is obtained as follows.

Iteration of RGS:

1. Draw .

2. If , then draw and set .

3. If , then draw and set .

Despite the simple structure of either TGS, these algorithms are widely applicable in the posterior analysis of complex Bayesian models. A TGS also arises naturally when is created via data augmentation techniques (Hobert, 2011; Tanner and Wong, 1987).

Inference based on often requires calculation of an intractable expectation. Let and let denote the expectation of with respect to . If a TGS Markov chain is ergodic (see Tierney, 1994) and , then

 ¯gn:=1nn−1∑i=0g(Xi,Yi)\lx@stackrela.s.⟶Eϖg as n→∞.

Thus estimation of is simple. However, the estimator will be more valuable if we can attach an estimate of the unknown Monte Carlo error . An approximation to the sampling distribution of the Monte Carlo error is available when a Markov chain central limit theorem (CLT) holds

 √n(¯gn−Eϖg)\lx@stackreld→N(0,σ2g) as n→∞

with . The variance accounts for the serial dependence in a TGS Markov chain and consistent estimation of it requires specialized techniques such as batch means, spectral methods or regenerative simulation. Let be an estimator of . If, with probability 1, as , then an asymptotically valid Monte Carlo standard error is . These tools allow the practitioner to use the results of a TGS simulation with the same level of confidence that one would have if the observations were a random sample from . For more on this approach the interested reader can consult Geyer (1992), Geyer (2011), Flegal, Haran and Jones (2008), Flegal and Jones (2010), Flegal and Jones (2011), Hobert et al. (2002), Jones et al. (2006), and Jones and Hobert (2001).

The CLT will obtain if for some and the Markov chain is rapidly mixing (Chan and Geyer, 1994). In particular, we require that the Markov chain be geometrically ergodic; that is, converge to the target in total variation norm at a geometric rate. Under these same conditions methods such as batch means and regenerative simulation provide strongly consistent estimators of . Thus establishing geometric ergodicity is a key step in ensuring the reliability of a TGS as a method for estimating features of .

The convergence rate of DGS Markov chains has received substantial attention. In particular, sufficient conditions for geometric ergodicity have been developed for several DGS chains for practically relevant statistical models (see e.g. Hobert and Geyer, 1998; Johnson and Jones, 2010; Jones and Hobert, 2004; Marchev and Hobert, 2004; Roberts and Polson, 1994; Roberts and Rosenthal, 1999; Román and Hobert, 2011; Román, 2012; Rosenthal, 1996; Roy and Hobert, 2007; Tan and Hobert, 2009). The convergence rates of RGS chains has received almost no attention despite sometimes being useful. Liu, Wong and Kong (1995) did investigate geometric convergence of RGS chains, but the required regularity conditions are daunting and, to our knowledge, have not been applied to practically relevant statistical models. Recently Johnson, Jones and Neath (2011) gave conditions which simultaneously establish geometric ergodicity of both the DGS chain and the corresponding RGS chain. These authors also conjectured that if the RGS chain is geometrically ergodic, then so is the DGS chain. That is to say, the qualitative convergence properties of TGS chains coincide. We are not able to resolve this conjecture in general, but in our main application (see Section 5) this is indeed the case.

A TGS chain which converges subgeometrically (ie, slower than geometric) would not be as useful as another chain which is geometrically ergodic–although with additional moment conditions it is still possible for a CLT to hold (Jones, 2004). Thus it would be useful to have criteria to check for subgeometric convergence. We are unaware of any previous work investigating subgeometric convergence of TGS Markov chains.

In the rest of this paper, we extend the results of Johnson, Jones and Neath (2011) and provide a condition which can be used to simultaneously establish geometric ergodicity of DGS and RGS Markov chains. We then turn our attention to development of a condition which ensures that both the DGS and RGS chains converge subgeometrically. Finally, we apply our results to a class of bivariate distributions where we are able to characterize the convergence properties of the DGS and RGS chains. But we begin with some Markov chain background and a formal definition of the Markov chains we study.

## 2 Background and Notation

Let be a topological space and denote its Borel -algebra. Also, let be a Markov chain having Markov transition kernel . That is, such that for each , is a nonnegative measurable function and for each , is a probability measure. As usual, acts to the left on measures so that if is a measure on and , then

 νP(A)=∫Zν(dz)P(z,A).

For any , the -step Markov transition kernel is given by .

Let be an invariant probability measure for , that is, . Denote total variation norm by . If is ergodic, then for all we have as . Our goal is to study the rate of this convergence. Suppose there exist a real-valued function on and such that for all

 ||Pn(z,⋅)−w(⋅)||TV≤M(z)tn. (1)

Then is geometrically ergodic, otherwise it is subgeometrically ergodic.

### 2.1 Two-variable Gibbs samplers

In this section we define the Markov kernels associated with the DGS and RGS chains described in Section 1. We also introduce a third Markov chain which will prove crucial to our study of the other Markov chains.

Recall that is a probability distribution having support , . Let be a density of with respect to a measure . Then the marginal densities are given by

 πX(x)=∫Yπ(x,y)μY(dy)

and similarly for . The conditional densities are and .

Consider the DGS Markov chain and let

 kDGS(x′,y′|x,y)=πX|Y(x′|y)πY|X(y′|x′).

Then the Markov kernel for is defined by

 PDGS((x,y),A)=∫AkDGS(x′,y′|x,y)μ(d(x′,y′))A∈B(X)×B(Y).

It is well known that the two marginal sequences comprising are themselves Markov chains (Liu, Wong and Kong, 1994). We now consider the marginal sequence and define

 kX(x′|x)=∫YπX|Y(x′|y)πY|X(y|x)μY(dy).

The Markov kernel for is

 PX(x,A)=∫AkX(x′|x)μX(dx′)A∈B(X).

Note that has as its invariant distribution while has the marginal as its invariant distribution.

Finally, consider the RGS Markov chain . Let and denote Dirac’s delta. Define

 kRGS(x′,y′|x,y)=pπX|Y(x′|y)δ(y′−y)+(1−p)πY|X(y′|x)δ(x′−x).

Then the Markov kernel for is

 PRGS((x,y),A) =∫AkRGS(x′,y′|x,y)μ(d(x′,y′))

It is easy to show via direct computation that is invariant for .

It is well known that and converge to their respective invariant distributions at the same rate (Diaconis, Khare and Saloff-Coste, 2008; Liu, Wong and Kong, 1994; Robert, 1995; Roberts and Rosenthal, 2001). Thus if one is geometrically ergodic, then so is the other. This relationship has been routinely exploited in the study of TGS chains for practically relevant statistical models (cf. Hobert and Geyer, 1998; Johnson and Jones, 2010; Jones and Hobert, 2004; Roy and Hobert, 2007; Tan and Hobert, 2009) since one of the two chains may be easier to analyze than the other. Recently, Johnson, Jones and Neath (2011) connected the geometric ergodicity of to that of . Thus establishing geometric ergodicity of TGS algorithms often comes down to analyzing . This is exactly the approach we take in Sections 3 and 5.

## 3 Conditions for Geometric Ergodicity

In this section we develop general conditions which ensure that , and are geometrically ergodic. First we need a couple of concepts from Markov chain theory. Recall the notation from Section 2. That is, is a Markov kernel on . Then is Feller if for any open set , is a lower semicontinuous function. The Markov kernel acts to the right on functions so that for measurable

 Pf(z)=∫Zf(z′)P(z,dz′).

A drift condition holds if there exists a function , and constants and satisfying

 PU(z)≤λU(z)+L     for all z∈Z. (2)

Recall that a function is said to be unbounded off compact sets if the sublevel set is compact for every . If is Feller, is unbounded off compact sets and satisfies (2), then is geometrically ergodic. See Meyn and Tweedie (1993) and Roberts and Rosenthal (2004) for details while Jones and Hobert (2001) give an introduction to the use of drift conditions.

### 3.1 Two-variable Gibbs samplers

Johnson, Jones and Neath (2011) gave a set of conditions which simultaneously prove that , and are geometrically ergodic. We build on their work and show how a drift condition for naturally provides a drift condition for . This allows us to develop an alternative set of conditions which are sufficient for the geometric ergodicity of , and . The application of this method is illustrated in Section 5.

The following result was essentially proved by Johnson, Jones and Neath (2011), but it was not stated in their paper; see Johnson (2009) for related material. Thus we provide a proof for the sake of completeness. First we set some notation. Suppose and let

 G(y)=∫XV(x)πX|Y(x|y)μX(dx).

Also, for define

 W(x,y)=V(x)+cG(y). (3)
###### Lemma 1.

Suppose there exist constants and such that for all

 PXV(x)≤λV(x)+L.

If and , then there exists such that

 PRGSW(x,y)≤γW(x,y)+(1−p)cL.
###### Proof.

Notice that

 ∫YG(y)πY|X(y|x)μY(dy) =∫Y∫XV(x′)πX|Y(x′|y)πY|X(y|x)μX(dx′)μY(dy) =∫XV(x′)∫YπX|Y(x′|y)πY|X(y|x)μY(dy)μX(dx′) =∫XV(x′)kX(x′|x)μX(dx′) ≤λV(x)+L.

Since

 p1−p

there exists such that

 (1−p)(cλ+1)∨p(1+c)c≤γ<1. (5)

Then

 PRGSW(x,y) =∫X∫YW(x′,y′)kRGS(x′,y′|x,y)μX(dx′)μY(dy′) =p∫X∫YW(x′,y′)πX|Y(x′|y)δ(y′−y)μX(dx′)μY(dy′) +(1−p)∫X∫YW(x′,y′)πY|X(y′|x)δ(x′−x)μX(dx′)μY(dy′) =p∫XW(x′,y)πX|Y(x′|y)μX(dx′)+(1−p)∫YW(x,y′)πY|X(y′|x)μY(dy′) =p∫X[V(x′)+cG(y)]πX|Y(x′|y)μX(dx′) +(1−p)∫Y[V(x)+cG(y′)]πY|X(y′|x)μY(dy′) =pcG(y)+(1−p)V(x)+pG(y)+(1−p)c∫YG(y′)πY|X(y′|x)μY(dy′) =p(1+c)G(y)+(1−p)V(x)+(1−p)c∫YG(y′)πY|X(y′|x)μY(dy′) ≤(1−p)cλV(x)+(1−p)cL+p(1+c)G(y)+(1−p)V(x) =(1−p)(cλ+1)V(x)+p(1+c)G(y)+(1−p)cL ≤γW(x,y)+(1−p)cL.

All that remains is to show that . Now

 γ ≥(1−p)(cλ+1) by (???) >(1−p)(p1−pλ+1)  by (???) =pλ+(1−p) >λ since λ,p∈(0,1).

The following is an easy consequence of Lemma 1 and the material stated at the beginning of this section.

###### Proposition 1.

Suppose and are Feller. If there exists a function such that both and the corresponding (as defined at (3)) are unbounded off compact sets, and there exist constants and such that for all

 PXV(x)≤λV(x)+L,

then , and are geometrically ergodic.

## 4 Conditions for Subgeometric Convergence

Our goal in this section is to develop a condition which ensures that , and converge subgeometrically, but first we need a few concepts from general Markov chain theory. Recall the notation of Section 2. In particular, is a Markov kernel on having invariant distribution . A Markov kernel defines an operator on the space of measurable functions that are square integrable with respect to the invariant distribution, denoted . Also, let

 L20,1(w)={f∈L2(w):Ewf=0, and Ewf2=1}.

For , define the inner product as

 ⟨f,g⟩=∫Zf(z)g(z)w(dz)

and . The norm of the operator is

 ∥P∥=supf∈L20,1(w)∥Pf∥.

If is symmetric with respect to , that is, if

 P(z,dz′)w(dz)=P(z′,dz)w(dz′), (6)

then is self-adjoint so that . If is -symmetric, then is geometrically ergodic if and only if (Roberts and Rosenthal, 1997). Moreover, if and , then

 ∥P∥=supf∈L20,1(w)|⟨Pf,f⟩|=supf∈L20,1(w)|E[f(Z′)f(Z)]|. (7)

The first equality is a property of self-adjoint operators while the second equality follows directly from the definition of inner product.

### 4.1 Two-variable Gibbs samplers

It is easy to see that is -symmetric and is -symmetric, but is not -symmetric. Because and are symmetric, the operator theory described above applies. In particular, if and , then

 ∥PX∥=supf∈L20,1(ϖX)|E[f(X′)f(X)]|

while if and , then

 ∥PRGS∥=supf∈L20,1(ϖ)|E[f(X′,Y′)f(X,Y)]|.

Note that despite our use of for both operator norms, these are different since they are based on different spaces.

If we can show that , then we will be able to conclude that , , and are subgeometrically ergodic. First, we need convenient characterizations of the operator norms.

###### Lemma 2.

If , then

 ∥PX∥ =1−inff∈L20,1(ϖX)E(Var(f(X)|Y))

and

 ∥PRGS∥ =1−inff∈L20,1(ϖ){pE(Var(f(X,Y)|Y))+(1−p)E(Var(f(X,Y)|X))}.
###### Proof.

Suppose , and . Then

 ∥PX∥ =supf∈L20,1(ϖX)|E[f(X′)f(X)]| =supf∈L20,1(ϖX)Var(E(f(X)|Y)) =1−inff∈L20,1(ϖX)E(Var(f(X)|Y)).

In the above, the second equality follows from Liu, Wong and Kong (1994, Lemma 3.2) and the last equality holds since for

 1=E(Var(f(X)|Y))+Var(E(f(X)|Y)).

Now consider . Suppose and . Then

 E[h(X′,Y′)h(X,Y)] =∫h(x′,y′)h(x,y)kRGS(x′,y′|x,y)π(x,y)μX(dx′)μY(dy′)μX(dx)μY(dy) =∫h(x′,y′)h(x,y)π(x,y)[pπX|Y(x′|y)δ(y′−y) +(1−p)πY|X(y′|x)δ(x′−x)]μX(dx′)μY(dy′)μX(dx)μY(dy) =∫ph(x′,y)h(x,y)πX|Y(x′|y)π(x,y)μX(dx′)μX(dx)μY(dy) +∫(1−p)h(x,y′)h(x,y)πY|X(y′|x)π(x,y)μY(dy′)μX(dx)μY(dy) =∫ph(x,y)E[h(X′,Y)|Y=y]π(x,y)μX(dx)μY(dy) +∫(1−p)h(x,y)E[h(X,Y′)|X=x]π(x,y)μX(dx)μY(dy)
 =∫ph(x,y)E[h(X′,Y)|Y=y]πX|Y(x|y)πY(y)μX(dx)μY(dy) +∫(1−p)h(x,y)E[h(X,Y′)|X=x]πY|X(y|x)πX(x)μX(dx)μY(dy) =∫pE[h(X,Y)|Y=y]E[h(X′,Y)|Y=y]πY(y)μY(dy) +∫(1−p)E[h(X,Y)|X=x]E[h(X,Y′)|X=x]πX(x)μX(dx) =∫p(E[h(X,Y)|Y=y])2πY(y)μY(dy) +∫(1−p)(E[h(X,Y)|X=x])2πX(x)μX(dx) =pE[(E[h(X,Y)|Y])2]+(1−p)E[(E[h(X,Y)|X])2].

Now since  ,

 Var(E[h(X,Y)|Y])=E[(E[h(X,Y)|Y])2]

and

 Var(E[h(X,Y)|X])=E[(E[h(X,Y)|X])2].

Moreover,

 1=Varϖ[h(X,Y)]=Var(E[h(X,Y)|Y])+E(Var[h(X,Y)|Y])

and

 1=Varϖ[h(X,Y)]=Var(E[h(X,Y)|X])+E(Var[h(X,Y)|X]).

The result follows easily. ∎

###### Proposition 2.

Suppose there exists a sequence such that if , then

 liminfi→∞E[Var(hi(X)|Y)]=0. (8)

Then . Hence , and are subgeometrically ergodic.

###### Proof.

The claim that follows easily from the first part of Lemma 2. Now consider . Note that if , then . From the second part of Lemma 2 we have

 ∥PRGS∥=1−inff∈L20,1(ϖ){pE(Var[f(X,Y)|Y])+(1−p)E(Var[f(X,Y)|X])}.

The claim now follows easily since if , then

 E(Var[f(X,Y)|X])=E(Var[hi(X)|X])=0

and

 E(Var[f(X,Y)|Y])=E(Var[hi(X)|Y]).

Thus we conclude that and are subgeometrically ergodic. Since and are either both geometrically ergodic or both subgeometric, it follows that also converges subgeometrically. ∎

## 5 A Discrete Example

We introduce a family of simple discrete distributions which admit usage of the TGS algorithms. We then apply our general results which will allow us to very nearly characterize the members of the family which admit geometrically ergodic TGS Markov chains.

Let and be strictly positive sequences satisfying

 ∞∑i=1ai+∞∑i=1bi=1.

Also, let . Let the family consist of the discrete bivariate distributions having density with respect to counting measure on given by

 π(x,y)=⎧⎨⎩axx=y,y=1,2,3,…;byx=y+1,y=1,2,3,…;0otherwise.

Hence the marginals are given by

 πX(x)=∞∑y=1π(x,y)=∞∑y=1axI(x=y)+byI(y=x−1)=ax+bx−1

and

 πY(y)=∞∑x=1π(x,y)=∞∑x=1axI(x=y)+byI(y=x+1)=ay+by.

The full conditionals are easily seen to be

 πX|Y(x|y)=ayay+byI(x=y)+byay+byI(x=y+1)y=1,2,3,…

and

 πY|X(y|x)=axax+bx−1I(x=y)+bx−1ax+bx−1I(y=x−1)x=1,2,3,….

Define

 px=axbx(ax+bx−1)(ax+bx) and qx=ax−1bx−1(ax+bx−1)(ax−1+bx−1).

Then for the DGS

 kDGS(x′,y′|x,y)=πX|Y(x′|y)πY|X(y′|x′)

and hence for the marginal chain

 kX(x′|x)=∞∑y=1πX|Y(x′|y)πY|X(y|x)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1−p1x′=x=1;pxx′=x+1,x≥1;qxx′=x−1,x≥2;1−px−qxx′=x,x≥2; and0 otherwise.

It is easy to see that the kernel is Feller. If , then for the random scan Gibbs sampler (RGS) we have

 kRGS(x′,y′|x,y)=pπX|Y(x′|y)δ(y′−y)+(1−p)πY|X(y′|x)δ(x′−x).

Since for any open set

 PRGS((x,y),O)=p∞∑x′=1πX|Y(x′|y)I((x′,y)∈O)+(1−p)∞∑y′=1πY|X(y′|x)I((x,y′)∈O)

it is easy to see that is lower semicontinuous and hence is Feller.

We are now in position to establish sufficient conditions for the geometric ergodicity of , and .

###### Lemma 3.

If

 limsupx→∞pxqx<1 and liminfx→∞qx>0, (9)

then , and are geometrically ergodic.

###### Proof.

We need only verify the conditions of Proposition 1 and we’ve already seen that both and are Feller. Set for some which will be determined later and note that

 G(y)=∞∑x=1V(x)πX|Y(x|y)=(ay+zbyay+by)zy.

For any , the sublevel set is bounded. Since is a continuous function, is also closed, hence compact. Therefore is unbounded off compact sets on . On the other hand, for any , the sublevel set is bounded. Then for any , is unbounded off compact sets on because for any , is bounded and closed, hence compact. Now, all that remains is to construct a drift condition for . Note that for ,

 PXV(x) =∞∑x′=1zx′kX(x′|x) =pxzx+1+qxz