1 Introduction
###### Abstract

This paper considers a new approach to using Markov chain Monte Carlo (MCMC) in contexts where one may adopt multilevel (ML) Monte Carlo. The underlying problem is to approximate expectations w.r.t. an underlying probability measure that is associated to a continuum problem, such as a continuous-time stochastic process. It is then assumed that the associated probability measure can only be used (e.g. sampled) under a discretized approximation. In such scenarios, it is known that to achieve a target error, the computational effort can be reduced when using MLMC relative to exact sampling from the most accurate discretized probability. The ideas rely upon introducing hierarchies of the discretizations where less accurate approximations cost less to compute, and using an appropriate collapsing sum expression for the target expectation. If a suitable coupling of the probability measures in the hierarchy is achieved, then a reduction in cost is possible. This article focused on the case where exact sampling from such coupling is not possible. We show that one can construct suitably coupled MCMC kernels when given only access to MCMC kernels which are invariant with respect to each discretized probability measure. We prove, under assumptions, that this coupled MCMC approach in a ML context can reduce the cost to achieve a given error, relative to exact sampling. Our approach is illustrated on a numerical example.
Key words: Multilevel Monte Carlo, Markov chain Monte Carlo, Bayesian Inverse Problems

Markov chain Simulation for Multilevel Monte Carlo

BY AJAY JASRA, KODY LAW, & YAXIAN XU

Department of Statistics & Applied Probability, National University of Singapore, Singapore, 117546, SG. E-Mail: staja@nus.edu.sg, a0078115@u.nus.edu

School of Mathematics, University of Manchester, Manchester, M13 9PL, UK. E-Mail: kodylaw@gmail.com

## 1 Introduction

Consider a probability measure on a measurable space . For a collection of integrable and measurable functions , we are interested in computing expectations:

 π(φ):=∫Xφ(x)π(dx).

It is assumed that the exact value of is not available analytically and must be approximated numerically: one such approach is the Monte Carlo method, which is focused upon in this article.

There is an additional complication in the context of this paper; we assume that the probability measure is not available, even for simulation, and must be approximated. More precisely, we assume that is associated to a continuum problem, such as Bayesian inverse problems (e.g. [18, 29]), where one cannot evaluate the target exactly. As an example, may be associated to the solution of a partial differential equation (PDE) which needs to be approximated numerically. Such contexts arise in a wide variety of real applications; see [25, 29, 27] and the references therein. Thus, we assume that there exists a discretized approximation of say on , where is a potentially vector-valued parameter which controls the quality of the approximation and also associates to the cost of computations w.r.t. . In other words, suppose and that for integrable

 limL→∞|πL(φ)−π(φ)|=0

where and the cost associated to computing also grows (without bound) with . Examples of practical models with such properties can be found in [4, 19].

In the context outlined above, if exact sampling from is possible, then one solution to approximating is the use of the Monte Carlo method, by sampling i.i.d. from and using the approximation , where . It is well known that such a method can be improved upon using multilevel [12, 13, 16] or multi-index [15] Monte Carlo (MIMC) methods; we focus upon the former in this introduction, but note that our subsequent remarks broadly apply to the latter. The basic notion of the MLMC method is to introduce a collapsing sum representation

 πL(φ)=π0(φ)+L∑l=1[πl−πl−1](φ)

where is a hierarchy of probability measures on , which are approximations of , starting with the very inaccurate and computationally cheap and up to the most precise and most computationally expensive and is assumed to be integrable w.r.t. each of the probability measures. The idea is then, if one can sample exactly and a sequence of (dependent) couplings of exactly, then it is possible to reduce the cost, relative to i.i.d. sampling from to achieve a prespecified mean square error (MSE). That is, writing as an expectation w.r.t. the algorithm which approximates , with an estimate , the MSE is . The key to the cost reduction is sampling from a coupling of which is ‘good enough’; see [12, 13] for details.

In this article, we focus on the scenario where one can only hope to sample using some Markov chain method such as MCMC. In such scenarios it can be non-trivial to sample from good couplings of . There has been substantial work on this, including MCMC [11, 19, 23, 24] and sequential Monte Carlo (SMC) [4, 5, 9, 22]; see [21] for a review of these ideas. In the context of interest, most of the methods used [4, 5, 9, 19, 23, 24] rely upon replacing, to some extent, coupling with importance sampling (the only exception to our knowledge is [11]) and then performing the relevant sampling from some appropriate sequence of change of measures using MCMC or SMC. These procedures often require some ‘good’ change of measures, just as good couplings are required in MLMC. In the MIMCMC case, no exact proof of the improvement brought about by using MIMCMC is given ([24] only provide a proof for a simplified identity).

The main motivation of the methodology we introduce is to provide an approach which only requires a simple (and reasonable) MCMC algorithm to sample . We focus on approximating independently for each . Our approach does not seem to be used in the ML or MI literature: representing the Markov chain transitions invariant w.r.t.  and as an iterated map, one can simply couple the simulation of simple random variables that are used commonly in the sampling of the iterated map for and respectively (this is defined explicitly in Section 2). It is straightforward to establish that such an approach can provide consistent estimates of . We also remark that many well known Markov transitions such as Metropolis-Hastings or deterministic scan Gibbs samplers can be represented as an iterated map. The main issue is to establish that such a coupled approximation can be useful, in the sense that there is a reduction in cost, relative to MCMC from , to achieve a prespecified MSE. We show that under appropriate assumptions, that this can indeed be the case. The approach discussed in this paper is generally best used in the case where the MCMC kernel is rejection free, such as a Gibbs sampler or some non-reversible MCMC algorithms [7]. The basic idea outlined here can also be easily extended to the context where MIMC can be beneficial, but both the implementation and mathematical analysis are left to future work.

This article is structured as follows. In Section 2 we describe our method. In Section 3 we give our mathematical results. In Section 4 we provide a numerical examples illustrating our method. A summary is provided in Section 5. Mathematical results are given in appendices A and B.

## 2 Methodology

### 2.1 Notations

Let be a measurable space. For we write and as the collection of bounded measurable and Lipschitz functions respectively. For , we write the supremum norm . For , . For , we write the Lipschitz constant , where is used to denote the norm. (resp. ) denotes the collection of probability measures (resp. finite measures) on . For a measure on and a , the notation is used. Let be a Markov kernel (e.g. [26] for a definition) and and , then we use the notations and for , For a Markov kernel we write the iterates

 Kn(x0,dxn)=∫XKn−1(x0,dxn−1)K(xn−1,dxn)

with . For , the total variation distance is written . For the indicator is written . For a sequence and for , we use the compact notation , with the convention that if the resulting vector of objects is null.

### 2.2 Set-Up

We begin by concentrating upon a sequence , with , where can potentially increase, but is taken as fixed. It is assumed that there is a limiting , in the sense that for any

 liml→+∞|πl(φ)−π(φ)|=0.

As discussed in the introduction, our objective is to approximate the ML identity:

 πL(φ)=π0(φ)+L∑l=1[πl−πl−1](φ). (1)

We assume that each is associated to a scalar parameter , with , which represents the quality of the approximation w.r.t.  and the associated cost. Approximation of (1) can be peformed via Monte Carlo intergration, by sampling from dependent couplings of the pairs , independently for , and i.i.d. sampling from . We are focussed upon the scenario where exact sampling even from a given is not currently possible, but can be achieved by sampling a invariant Markov kernel, as in MCMC.

### 2.3 Approach

#### 2.3.1 Multilevel Approach

Let be given. Let be a finite-dimensional measuable space, a random variable on with probability and let , such that the map is jointly measurable on the product algebra . Consider the discrete-time Markov chain, for given, :

 Xn=ξl(Xn−1,Un)

where is a sequence of i.i.d. random variables with probability . Denote the associated Markov kernel . We explicitly assume that is constructed such that admits as an invariant measure; examples are given in Section 2.3.2. Under appropriate assumptions on (e.g. [10, 20]) or (e.g. [26]), one can show that will converge to zero as grows. In addition, under appropriate assumptions, for ,

 1NN∑n=1φ(Xn)

will converge almost surely to .

As noted, we are interested in the approximation of , as the approximation of is possible using the above discussion. The approach that is proposed in this article is as follows. Let be given and set . Generate the Markov chain for

 ¯¯¯¯¯Xn(l)=ξl(¯¯¯¯¯Xn−1(l),Un(l))andX––n(l)=ξl−1(X––n−1(l),Un(l))

with the notation . Note that the sequence of random numbers, , are the same for the recursion of and . The Markov kernel associated to the Markov chain is denoted . Then if this scheme is repeated independently for each and one samples the Markov chain using the Markov kernel we have the following estimate of (1)

 1N0N0∑n=1φ(Xn(0))+L∑l=11NlNl∑n=1[φ(¯¯¯¯¯Xn(l))−φ(X––n(l))].

Under appropriate assumptions on or , one can easily prove that the above estimate is consistent. The question is whether this can improve upon sampling from using (in the sense discussed in the introduction), in some scenarios of practical interest. This point is considered in Section 3.

A strategy for coupling MCMC is considered in [1], except when trying to perform unbiased estimation using the approach in [28]. That idea is different to the one presented in this article, and it is not analyzed in the context of variances, as is the case in this paper.

#### 2.3.2 Examples of ξl

There are numerous examples of mappings which fall into the framework of this article. Let us suppose that for any we have

 πl(dx)=πl(x)ν(dx)

with and a non-negative function that is known up-to a constant.

A wide class of Metropolis-Hastings kernels are one such example, which is taken from the description in [8]. Let be the current state of the Markov chain. A proposal is generated. This may be for instance a (one-dimensional) Gaussian random walk, where , , and is the Gaussian distribution of mean zero and variance . Assuming that the proposal can be written

 Ql(x,dy)=ql(x,y)ν(dy)

for some non-negative function that is known, then we have the well-defined acceptance probability

 al(x,y)=⎧⎨⎩1∧πl(y)ql(y,x)πl(x)ql(x,y)if\leavevmode\nobreak (x,y)∈Rl0otherwise

where . Then, if where is the uniform distribution on , (so here )

 ξl(xn,u1:2)={ψl(xn,u1)%if\leavevmode\nobreak u2

[8] also demonstrate that the deterministic scan Gibbs sampler falls into our framework (in fact one can establish that the random scan can also do so).

###### Remark 2.1.

The approach detailed here assumes that the . However, this need not be the case. For instance it is straightforward to extend to the case where and . Here one would simply use the same random numbers in the mappings , on the common parts of the space. For example, if , then in the random walk example above, one simply uses the same Gaussian perturbation in the proposal on the space and one additional Gaussian must be sampled for the level . The same uniform is used in the accept/reject step. We note that the theory in Section 3 could be extended to the scenario we mentioned, with similar, but more complicated, arguments.

## 3 Theoretical Results

### 3.1 Assumptions

Throughout is compact. Recall that .

• There exist and such that for any

 supl≥0supx∈X∥Knl(x,˙)−πl(⋅)∥tv≤Cρn.
• There exist a such that for any , ,

 |Kl(φ)(x)−Kl(φ)(y)|≤C(∥φ∥∨∥φ∥Lip)|x−y|.
• There exist a and such that for any ,

• There exist a such that for any ,

 ∫U|ξl(x,u)−ξl(y,u)|2μ(du)≤τ|x−y|2.
• There exist a and such that for any ,

 ∫U|ξl(x,u)−ξl−1(x,u)|2μ(du)≤Chβl.

These assumptions are quite strong, but may be verified in practice. As is compact (A3.1) can be verified for a wide class of Markov kernels, see for instance [26]. (A3.1-3.1) relate to the continuity properties of the kernel and the underlying ML problem. In particular (A3.1) 2. states that moves under pairs of kernels at consecutive levels, stay close on average, given that they are initialised at the same point. (A3.1) is an assumption that has been considered by [20] in a slightly more general context and relates to contractive properties of the iterated map. (A3.1) is a map analogue of (A3.1) 2.. One expects that such properties are needed, in order for the coupling approach here to work well. The compactness and uniform ergodicity assumptions could be weakened to say geometric or polynomial type ergodicity and non-compact spaces with longer, but similar, arguments (see e.g. [3]).

In general for rejection type Markov chains, such as Metropolis-Hastings kernels, (A3.1), (A3.1) 2. and (A3.1) may be in conflict. (A3.1) 2. and (A3.1) suggest that the moves of the kernels at consecutive levels as well as the coupling are sufficiently close and good respectively. However, (A3.1) demands a reasonably fast convergence rate that is independent of the level. This latter assumption can be made dependent, but the key point is that the mixing rate should not go to zero with . Conversely, for (A3.1) 2. and (A3.1), one needs that if one level accepts and the other rejects, that the resulting positions of the chains are sufficiently close, as a function of . For instance for (A3.1) 2. with Gaussian random walk proposals with different scalings at consecutive levels, one will typically need a scale of , which will likely mean a reduction in the rate of convergence and possibly negate the advantage of a multilevel method. This is rather unsurprising, in that one wishes to keep consecutive levels close via coupling.

The main message is then: for rejection free algorithms such as the Gibbs sampler or some non-reversible MCMC kernels one should simply check if the couplings are ‘good enough’ in the sense of (A3.1) (one suspects that (A3.1) could be significantly weakened). Then one expects ML improvements. For MCMC methods with rejection, the same must be done, but, it will not be clear (without mathematical calculations or numerical trials) that this method of implementing the coupling of MCMC will yield reductions in computational effort for a given level of MSE.

### 3.2 Main Result

The main idea is to consider

 E[(1N0N0∑n=1[φ(Xn(0))−π0(φ)]+L∑l=11NlNl∑n=1{[φ(¯¯¯¯¯Xn(l))−φ(X––n(l))]−[πl−πl−1](φ)}−
 [πL−π](φ))2].

This error is clearly equal to

 E[(1N0N0∑n=1[φ(Xn(0))−π0(φ)])2]+L∑l=1E[(1NlNl∑n=1{[φ(¯¯¯¯¯Xn(l))−φ(X––n(l))]−[πl−πl−1](φ)})2]+
 L∑l≠q=1E[flfq]+[πL−π](φ)2,

where

 f0:=1N0Nl∑n=1[φ(Xn(0))−π0(φ)], (2)

and for ,

 fj:=1NjNj∑n=1{[φ(¯¯¯¯¯Xn(j))−φ(X––n(j))]−[πj−πj−1](φ)}.

The terms (2) and

 E[(1N0N0∑n=1[φ(Xn(0))−π0(φ)])2]

can be treated by standard Markov chain theory for in an appropriate class and under our assumptions. That is, under (A3.1), one can easily prove that there exist a such that for any ,

 ∣∣E[(1N0Nl∑n=1[φ(Xn(0))−π0(φ)])]∣∣≤C(∥φ∥∨∥φ∥{Lip})N0

and

 E[(1N0N0∑n=1[φ(Xn(0))−π0(φ)])2]≤C(∥φ∥∨∥φ∥{Lip})2N0

For the remaining terms, we have the following results, whose proofs can be found in Appendix A.

###### Proposition 3.1.

Assume (A3.1,3.1). Then there exist a such that for any , , :

 ∣∣E[(1NlNl∑n=1{[φ(¯¯¯¯¯Xn(l))−φ(X––n(l))]−[πl−πl−1](φ)})]∣∣≤C(∥φ∥∨∥φ∥%\emphLip)hβlNl.
###### Theorem 3.1.

Assume (A3.1-3.1). Then there exist a such that for any , , :

###### Remark 3.1.

The case where depends upon , i.e.  could also be treated using the approach in Appendix A. This would require some additional calculations, but we believe a similar result to Theorem 3.1 would hold on some minor additional assumptions on .

### 3.3 Verifying the Assumptions

We consider a deterministic scan Gibbs sampler. We set , and remark that need not be a one-dimensional space (but must be compact). We set for

 πl(dx1:k)=πl(x1:k)ν(dx1:k)

where in an abuse of notation, we use to denote the density and measure, and , is the dominating measure, typically Lebesgue or counting. Then for we set

 Kl(x1:k,dx′1:k)=(k∏i=1πl(x′i|x′1:i−1,xi+1:k))ν(dx′1:k) (3)

where for

 πl(x′i|x′1:i−1,xi+1:k)=πl(x′1:i,xi+1:k)∫Xiπl(x′1:i,xi+1:k)νi(dx′i).

We further suppose that can be sampled by , () and using the transformation

 Tl,i([x′1:i−1,xi+1:k],Ui) (4)

for each .

1. There exists such that for any ,

 C––≤πl(x1:k)≤¯¯¯¯C.
2. There exists such that for any ,

 |πl(x1:k)−πl(y1:k)|≤C|x1:k−y1:k|.
3. There exists such that for any ,

 |πl(x1:k)−πl−1(x1:k)|≤Chβl.
4. There exists such that for any , , ,

 |Tl,i([x1:k],ui)−Tl,i([y1:k],ui)|2≤τ|x1:k−y1:k|2.
5. There exists such that for any , ,

 |Tl,i([x1:k],ui)−Tl−1,i([x1:k],ui)|2≤Chβl.
###### Proposition 3.2.

Assume (B3.3). Then the target and deterministic Gibbs sampler corresponding to (3) which is sampled via (4), satisfies (A3.1-3.1).

The proof is given in Appendix B.

## 4 Numerical Experiment

### 4.1 Model and MCMC

We consider a slightly modified model in [2]. Data , are such that for ,

 Yi|ui\lx@stackrelind∼N(ui,λ−1).

is assumed to be known. The prior model on the are associated to an unknown hyper-parameter and we assume a joint prior for any

 p(u1:k,δ)=exp{−δ2k∑j=1j−3u2j}δα0−1exp{−δκ0},

which is improper if . It is easily checked that, for any , the posterior on is proper provided .

Set , where ; for some , we consider a sequence of posteriors for on (the data are simulated from the true model). One can construct a Gibbs sampler as in [2] which has full conditional densities:

 u1:Kl|y1:Kl,δ ∼ NKl(ml(δ),Cl(δ)) δ|y1:Kl,u1:Kl ∼ G(α0,κl)

where is a dimensional Gaussian distribution of mean vector and covariance matrix , , , is a Gamma distribution of mean () and .

The Gibbs sampler is easily coupled when considering levels and , . Denote by the lower triangular Cholesky factor of . When considering the update at level , , one samples a ( is the identity matrix) and sets

 ¯¯¯¯U(l)1:Kl=ml(δ)+Ll(δ)V1:Kl

and

 U––(l)1:Kl−1=ml−1(δ)+Ll−1(δ)V1:Kl−1.

Note that the value of will typically be different between levels and . For the update on , one can sample and set and .

###### Remark 4.1.

The paper [2] focuses on improving the algorithm described here by a reformulation of the problem which requires an accept/reject step. We do not consider the improved sampling algorithm, as the objective here is to illustrate that the method developed in this paper works for rejection-free MCMC.

### 4.2 Simulation Results

In our experiments , , and . We consider the posterior expectation of the function . This expectation appears to converge to a limit as grows. We first run an experiment to determine the (as in Theorem 3.1) and order of bias of the approximation (expected to be for some ). The true value is computed by using the algorithm at one plus the most precise level (i.e. ). The results are presented in Figure 1, which suggest that and . The cost of the algorithm at level is . Using the standard ML theory (e.g. [12]), for given we set and . In Figure 2 we can see the cost against MSE for the MLMCMC procedure, versus the MCMC at level with samples. The improvement is clear.

## 5 Summary

In this article we have considered a new Markov chain simulation approach to implementing MLMC. The main utility to using this idea, is that one needs only a standard MCMC procedure for sampling . Then the additional implementation is straightforward and can sometimes drastically reduce the cost to achieve a given level of MSE. As we have remarked, one does require some properties of the Markov kernel simulated, in order for the strategy that we have suggested to work well in practice. In general, we believe it is of most use if the Markov chain method is rejection free. In contexts where this cannot be achieved, we believe a more intricate coupling is required (see e.g. [6, 17]).

#### Acknowledgements

AJ was supported by an AcRF tier 2 grant: R-155-000-161-112. AJ is affiliated with the Risk Management Institute, the Center for Quantitative Finance and the OR & Analytics cluster at NUS. AJ was supported by a KAUST CRG4 grant ref: 2584. KJHL was supported by the University of Manchester School of Mathematics.

## Appendix A Variance Results

In order to prove Proposition 3.1 and Theorem 3.1 we will make use of the (additive) Poisson equation (e.g. [14]). That is, let then given our Markov kernels we say that solves the Poisson equation if

 φ(x)−πl(φ)=ˆφl(x)−Kl(ˆφl)(x).

If (A3.1) holds, then a solution is

 ˆφl(x)=∑n≥0[Knl(φ)(x)−πl(φ)]

which is the one that is considered here.

Our proof consists of several lemmata, followed by the proofs of Proposition 3.1 and Theorem 3.1. Throughout is a finite constant which may change from line-to-line, but will not depend on any important parameters (e.g. , ) unless stated.

###### Lemma A.1.

Assume (A3.1-3.1). Then there exists a , such that for any , we have:

 |ˆφl(x)−ˆφl(y)|≤C(∥φ∥∨∥φ∥\emph{Lip})|x−y|.
###### Proof.

We have

 ˆφl(x)−ˆφl(y)=φ(x)−φ(y)+Kl(φ)(x)−Kl(φ)(y).

The proof follows then by the triangular inequality and (A3.1). ∎

###### Lemma A.2.

Assume (A3.1,3.1). Then there exists a , such that for any , we have:

 supx∈X|ˆφl(x)−ˆφl−1(x)|≤C∥φ∥hβl.
###### Proof.

By [3, Proposition C.2] we have

 ˆφl(x)−ˆφl−1(x)=
 ∑n≥1[n−1∑i=0[Kil−πl]{[Kl−Kl−1]([Kn−i−1l−1−πl−1](φ))}(x)−[πl−πl−1]{[Knl−1−πl−1](φ)}].

We consider the two terms in the difference of the summand individually.

Term: . Now

 [Kl−Kl−1]([Kn−i−1l−1−πl−1](φ))(x)=
 Osc(φ)∥Kn−i−1l−1−πl−1∥tv[[Kl−Kl−1]([Kn−i−1l−1−πl−1](φOsc(φ))∥Kn−i−1l−1−πl−1∥tv)(x)].

By (A3.1) and

 [Kn−i−1l−1−πl−1](φOsc(φ))∥Kn−i−1l−1−πl−1∥tv

is bounded, so we have by (A3.1)

 ∥[Kl−Kl−1]([Kn−i−1l−1−πl−1](φ))∥≤C∥φ∥ρn−i−1hβl (5)

for some that does not depend upon . Now

 [Kil−πl]{[Kl−Kl−1]([Kn−i−1l−1−πl−1](φ))}(x)=
 Osc([Kl−Kl−1]([Kn−i−1l−1−πl−1](φ)))[Kil−πl]{[Kl−Kl−1]([Kn−i−1l−1−πl−1](φ))Osc([Kl−Kl−1]([Kn−i−1l−1−π