Markov Interacting Importance Samplers

Markov Interacting Importance Samplers

Eduardo F. Mendes    Marcel Scharth    Robert Kohn
June 30, 2019
Abstract

We introduce a new Markov chain Monte Carlo (MCMC) sampler called the Markov Interacting Importance Sampler (MIIS). The MIIS sampler uses conditional importance sampling (IS) approximations to jointly sample the current state of the Markov Chain and estimate conditional expectations, possibly by incorporating a full range of variance reduction techniques. We compute Rao-Blackwellized estimates based on the conditional expectations to construct control variates for estimating expectations under the target distribution. The control variates are particularly efficient when there are substantial correlations between the variables in the target distribution, a challenging setting for MCMC. An important motivating application of MIIS occurs when the exact Gibbs sampler is not available because it is infeasible to directly simulate from the conditional distributions. In this case the MIIS method can be more efficient than a Metropolis-within-Gibbs approach. We also introduce the MIIS random walk algorithm, designed to accelerate convergence and improve upon the computational efficiency of standard random walk samplers. Simulated and empirical illustrations for Bayesian analysis show that the method significantly reduces the variance of Monte Carlo estimates compared to standard MCMC approaches, at equivalent implementation and computational effort.

Keywords: Bayesian inference; Control variate; Mixed Logit; PMCMC; Markov Modulated Poisson Process; Rao-Blackwellization; Variance reduction.

1 Introduction

This paper introduces Markov interacting importance samplers (MIIS), a general Markov Chain Monte Carlo (MCMC) algorithm that iterates by sampling the current state from a conditional importance sampling approximation to a target distribution. An importance sampling (IS) approximation consists of a set of weighted samples from a proposal distribution that approximates the target. Markov interacting importance samplers are conditional in the sense that the importance distribution may depend on the previous state of the Markov chain. The marginal distribution of the states converges to the target distribution for any number of importance samples at each iteration of the Markov chain; the algorithm does not induce an approximation error.

We adopt importance sampling as a basic tool from the perspective that it can be more efficient than a Metropolis-Hastings sampler based on an identical proposal. Importance sampling naturally incorporates the information from all generated samples, while standard Metropolis-Hastings estimates lose information from rejected draws. In addition, importance sampling estimates are based on independent samples and as a consequence the method is immediately amenable to a range of variance reduction techniques (such as antithetic sampling and stratified mixture sampling), as well as convenient to implement and parallelize. It is not standard practice in applied work to incorporate these features into Metropolis-Hastings approaches as they are more challenging to design and use efficiently in an MCMC framework. See for example Craiu and Lemieux (2007), Hammer and Tjelmeland (2008), Jacob et al. (2011), and Dellaportas and Kontoyiannis (2012).

Importance sampling can be efficient when we are able to construct numerically accurate and computationally fast approximations to a full target distribution. Richard and Zhang (2007), Hoogerheide et al. (2012) and Li et al. (2013) are recent contributions in this area that have led to the application of IS to challenging problems: see for example Liesenfeld et al. (2013) and Tran et al. (2014). We motivate MIIS by observing that even if the joint target density is intractable by global approximation, we can frequently obtain efficient importance samplers for the conditional distributions. MCMC methods provide a natural way of handling large dimensional problems by sampling from conditional distributions (Gibbs sampling) or by generating samples from complex target densities through local exploration. The MIIS algorithm leverages the advantages of importance sampling in this setting.

As a leading application, we consider the case in which it is not possible to implement an exact Gibbs sampler due to infeasibility of direct simulation from the conditional distributions. The MIIS method relies on IS approximations of the conditional distributions to sample the current state of the Markov Chain. The advantage of importance sampling is that we can additionally use the approximation (that is, all the generated samples) to estimate conditional expectations, possibly by incorporating the full range of variance reduction methods available for standard importance sampling. We compute Rao-Blackwellized estimates based on the conditional expectations to construct control variates for estimating expectations under the target distribution. The control variates are particularly effective when there are substantial correlations between the variables in the target distribution. This is a challenging setting for standard MCMC approaches because the conditioning scheme may imply strong serial correlation in the Markov chain.

We introduce the general MIIS algorithm and present four examples that demonstrate its flexibility. The first two examples present the implementation of MIIS based on simple importance sampling targeting the full and conditional distributions. We derive conditions for the ergodicity and uniform ergodicity of the sampler. The third example introduces antithetic variables and is also uniformly ergodic under general conditions. The final example introduces the MIIS random walk algorithm, designed to accelerate convergence and improve upon the computational efficiency of standard random walk samplers. The random walk sampler is uniformly ergodic assuming that the importance weights are bounded. Ergodicity holds under milder constraints.

Our method relates to the Particle Gibbs (PG) algorithm developed for Bayesian inference in general state space models by Andrieu et al. (2010). The PG algorithm iteratively draws the latent state trajectories from its high-dimensional smoothing distribution using a particle filter approximation, and the parameters of the model from their conditionals given the state trajectories. Lindsten and Schön (2012), Lindsten et al. (2014b), Mendes et al. (2014) and Carter et al. (2014) present extensions, while Chopin and Singh (2013), Andrieu et al. (2013) and Lindsten et al. (2014a) study the theoretical aspects of the algorithm. We can show that the particle Gibbs algorithm is a particular type of MIIS. Compared to PG, the MIIS algorithm addresses a wider class of sampling problems and the use of variance reduction methods.

We illustrate Markov interacting importance samplers in a range of examples. We consider the estimation of the posterior mean for a Bayesian Mixed Logit model using the health dataset studied by Fiebig et al. (2010). The presence of unobserved heterogeneous preferences in this discrete choice model motivates the use of MCMC methods that iteratively sample the model parameters and the latent choice attribute weights conditional on each other. The results show that the MIIS algorithm with control variates increases efficiency in mean squared error by a factor of four to twenty compared to the Metropolis-within-Gibbs algorithm, which is a standard tool for problems that are not amenable to exact Gibbs sampling. We also implement the MIIS random walk importance sampler for carrying out posterior inference for Markov modulated Poisson processes, a problem considered for example by Fearnhead and Sherlock (2006). Our analysis reveals four to hundredfold gains in efficiency over the standard random walk Metropolis algorithm and the multiple-try Metropolis algorithm of Liu et al. (2000). In this context, the improvements are mainly due to parallelization and better convergence of the Markov chain.

2 Markov Interacting Importance Samplers

To focus on the main ideas, we use densities in our mathematical discussion up to Section 6. We assume that the densities are defined with respect to measures that we leave unspecified for now. We provide a more precise treatment in Section 7 and the appendix.

2.1 Notation and basic definitions

This subsection presents some of the notation used in the article. We define the basic random variables on a set that is a subset of Euclidean space. Suppose that is a real function with . We take any density on to be with respect to some measure on , which we denote as . We define the expected value of with respect to the density as

 Eν(f) :=∫f(x)ν(x)dx (1)

provided the integral exists.

In our article, is the target density. We often can evaluate only up to a constant of proportionality , with , where is the normalizing constant. Suppose that . Then, for , we define , and .

2.2 Conditional Importance Sampler

This section introduces the conditional importance sampler (CIS) which is the basic building block of the MCMC algorithms in this article. The CIS is motivated by the question: “how to implement an importance sampler approximation to that provides unbiased samples?”The CIS is our solution to this problem. We go beyond simple importance sampler and construct a general framework that not only covers the simple importance sampling approximation with variance reduction techniques, but also extends the basic importance sampling paradigm, allowing local exploration of the target inside an MCMC setting, for instance, by using a random-walk approach.

At each iterate of an MCMC algorithm, the CIS constructs an empirical approximation to the target density . It generates an auxiliary variable and particles conditional on the previous iterate , in such a way that one particle is generated through a Markov transition kernel and the other particles are generated conditional on .

We now present a more precise description of the CIS. Let be the conditional density of the auxiliary variable , with , and take so that . Let be the density of a Markov transition kernel from to , conditional on , that is reversible with respect to ; i.e., , or equivalently,

 π(y)η(ξ|y)T(y,x;ξ) =π(x)η(ξ|x)T(x,y;ξ). (2)

Given , let be a joint importance distribution with marginals (). For any , define the conditional density

 q∖k(x∖k|xk,ξ):=q(x1:N|ξ)qk(xk|ξ). (3)
Definition 1 (Conditional Importance Sampler).

For any given and , the Conditional Importance Sampler generates from the probability distribution

 ΓN(x1:N,ξ|y,k):=η(ξ|y)T(y,xk;ξ)q∖k(x∖k|xk,ξ). (4)

The auxiliary variable introduces dependence in the importance sampling approximation. Moreover, we can often choose the auxiliary density so that is bounded. For instance, the random-walk importance sampling algorithm chooses . The weights are , which are bounded if is bounded. The dependence on can be easily dropped if one takes and each . The Markov transition kernel can be taken as the identity kernel, i.e., , which is our choice in Sections 3 and 6. A Metropolis-Hastings kernel targeting is also a valid choice.

The CIS generates using the following algorithm.

Algorithm 1 (Conditional Importance Sampler).

Given ,

1. sample ;

2. sample ; i.e., generate the particle using the Markov kernel.

3. sample ; i.e., generate all the remaining particles conditional on and the propagated particle .

From the output of the Conditional Importance Sampler we define the weights for

 Wi(x1:N;ξ):=wi(xi;ξ)∑Nj=1wj(xj;ξ)wherewi(x;ξ):=m(x)qi(x|ξ)η(ξ|x) (5)

and let be the empirical approximation to . The weights depend on the marginals () of , the auxiliary distribution and the target distribution . Based on , we define the estimator of as

 ˆENCIS(f) :=N∑i=1Wi(x1:N,ξ)f(xi)=EˆπNCIS(f). (6)

Define the joint density

 ˜πN(k,y,x1:N,ξ):=N−1π(y)ΓN(x1:N,ξ|y,k). (7)

Lemma 1 gives some fundamental properties of and shows that the expectation of is if the marginal distribution . We use , additively, within an MCMC scheme to construct unbiased estimators of . The unbiasedness property is critical for the variance reduction techniques in Section 5.

Theorem 1.

Suppose that is finite, is a sample from , and that is generated from . Then,

1. .

2.  ˜πN(k,y|x1:N,ξ) =N∑i=1Wi(x1:N,ξ)I(k=i)T(xi,y;ξ), (8) or equivalently, ˜πN(K=i|x1:N,ξ) =Wi(x1:N,ξ)and˜πN(y|x1:Nξ,k)=T(xk,y;ξ). (9)
3. .

Remark 1.

We now compare importance sampling to conditional importance sampling. In importance sampling, we draw particles from an importance or proposal density with marginal densities and calculate their importance weights

 Wi(x1:N):=wi(xi)∑Nj=1wj(xj),% wherewi(xi):=m(xi)qi(xi),

to obtain the approximation to . The IS sampling estimate of is

 ˆENIS(f):=N∑i=1Wi(x1:N)f(xi)=EˆπNIS(f) (10)

In the simplest case, the particles are sampled independently from the same proposal distribution , i.e., and . Despite similarities, there fundamental differences between using and .

1. The marginal distribution of a sample from is not , while the distribution of from is . Similarly,

 Eq(ˆEISπ(f))≠Eπ(f), (11)

whereas .

2. The weights in the CIS may depend on an auxiliary variable , with density , that incorporates past information in the proposal opening the possibility for using local proposals. Moreover, it can be used as a mechanism to bound the weights and provide more robust estimators.

2.3 Markov Interacting Importance Sampling Algorithm

The MIIS algorithm simulates from the target distribution on . It iterates by first constructing a discrete approximation to using the CIS, conditional on the previous state of the Markov Chain, and then samples from the approximation. It requires specifying a joint proposal distribution , an auxiliary distribution , and a Markov transition kernel .

Algorithm 2 (Markov Interacting Importance Sampler).

Given and , at step

1. Generate .

2. Generate .

3. Generate

 X(t)∖k(t−1)∣∣(x(t−1)k(t−1),k(t−1),ξ(t))∼q∖k(t−1)(x(t)∖k(t−1)∣∣x(t)k(t−1),k(t−1),ξ(t)).
4. For , calculate

Draw with probability .

5. Generate .

We divide the algorithm into two blocks. The first block consists of steps 1 to 3 and uses the CIS to draw an approximation to . It corresponds to Algorithm 1 in Section 2.2. The second block consists of steps 4 and 5 and draws an element from this approximation. It corresponds to part (ii) of Theorem 1.

The MIIS algorithm is a Gibbs sampler on an augmented space that contains all variables sampled in the CIS step, i.e., it is a Gibbs sampler targeting (7). It also follows that if , the marginal distribution of is the original target ; the MIIS algorithm generates samples from without the approximation error induced by the CIS step.

Theorem 2 (Target Distribution).

The Markov Interacting Importance Sampler is a Gibbs sampler targeting the augmented density (7) that has as a marginal density.

3 Examples

This section illustrates the MIIS methodology in three useful examples. For simplicity, the Markov transition density is set to the identity density, i.e., , which denotes a density in that integrates to 1 and which is zero exact at ; we will sometimes write it as . We do not use the auxiliary variable in the first two examples, which is equivalent to assuming that and . Section 7.2 gives formal convergence results for all three examples.

3.1 Simple Importance Sampling

This specification corresponds to the iterated Sampling Importance Resampling algorithm (i-SIR) in Andrieu et al. (2013). In importance sampling algorithms we generate particles independently from importance distributions (), i.e., . Hence and

 q∖k(x∖k|xk,k,ξ)=N∏i≠kq(xi).

The CIS in this case is

 ΓN(x1:Nξ|y,k)=η(ξ)δ(y−xk)N∏i≠kq(xi).

Algorithm 3 follows from Algorithm 2.

Algorithm 3 (MIIS with Simple Importance Sampling).

Given and ,

1. Generate , for , and set .

2. Draw with probability proportional to .

3. Set .

3.2 Importance Sampling with Antithetic Variables

In the importance sampling literature, the method of antithetic variables consists of drawing perfectly negatively correlated particles to reduce the variance of the Monte Carlo estimate. We can use this method within the MIIS framework. The importance sampler with antithetic variables draws the particles in pairs from a proposal distribution. Suppose that is even. For , let be the density of with corresponding cumulative distribution function and let , where is the inverse of . We write the joint density of as

 qk,N/2+k(xk,xN/2+k)=qk(xk)δQ−1k(1−Qk(xk))(xN/2+k).

The marginals are and the conditional density of given is . For notational simplicity assume . We sample the particle system given from

 q∖k(x∖k|xk,ξ,k) = δQ−1k(1−Qk(xk))(xN/2+k)N/2∏i≠kqi,N/2+i(xi,xN/2+i) = ∏N/2i=1qi,N/2+i(xi,xN/2+i)qk(xk) = q(x1:N)qk(xk),

and the CIS is

 ΓN(x1:Nξ|y,k)=η(ξ)δy(xk)∏N/2i=1qi,N/2+i(xi,xN/2+i)qk(xk).
Algorithm 4 (MIIS with Antithetic Variables).

Given and ,

1. Generate , for .

2. If , set , and . If , set , and .

3. Draw with probability proportional to .

4. Set .

3.3 Random Walk Importance Sampler

The random walk importance sampler draws particles from a symmetric proposal dependent on its past. The advantage is that the method bounds the weights by construction. The random walk proposal performs local exploration around the auxiliary variable , which we sample conditionally on the previous state.

Let denote the proposal functions for and . Then

 q∖k(x∖k|xk,k,ξ)=N∏i≠kϕ(xi−ξ)

The CIS is

 ΓN(x1:N,ξ|y,k)=δy(xk)ϕ(ξ−xk)N∏i≠kϕ(xi−ξ).

The random walk importance sampler bounds the weights if is bounded. The sampling algorithm follows from Algorithm 2

Algorithm 5 (MIIS with Random Walk proposal).

Given and ,

1. Generate

2. Generate , for , and set .

3. Draw with probability proportional to .

4. Set .

4 MIIS Targeting Conditional Distributions

This section shows how to use use the MIIS algorithm within a Gibbs sampling framework. We use the following notation. Suppose we partition as . Then, for , , etc. We define and . For a density , , we define the conditional density and the conditional expectation

 Eνs(⋅|x(∖s))(f):=∫Asf(x)νs(x(s)|x(∖s))dx(s). (12)

4.1 Conditional Importance Sampler for conditional distributions

The CIS for conditional distributions is similar to the CIS in Section 2.2, but now targets , . Given , and , let be the density of the auxiliary variable , conditional on . Let be a the density of a Markov transition kernel, conditional on , that is reversible with respect to .

Given and , let be a joint importance density with marginals (), and

 qs,∖ks(x∖ks(s)|xks,ξ(s),y(∖s)):=qs(x1:N(s)|ξ(s),y(∖s))qs,ks(xks(s)|ξ(s),y(∖s)). (13)
Definition 2 (Conditional Importance Sampler for conditional distributions:).

For , , and , the Conditional Importance Sampler for conditional distributions generates from the probability distribution

 ΓNs(x1:N(s),ξ(s)|y(s),ks,y(∖s))=ηs(ξ(s)|y(s),y(∖s))Ts(y(s),xks(s);ξ(s),y(∖s))×qs,∖ks(x∖ks(s)|xks(s),ks,ξ(s),y(∖s)). (14)

In the CIS for conditional densities, we first generate , then we generate conditional on , and finally the remaining particles conditional and

Suppose we express the target , where we can evaluate . From the output of the CIS for conditional distributions, we define the weights

 Ws,i(x1:N(s);ξ(s)|y(∖s))=ws,i(xi(s);ξ(s)|y(∖s))∑Nj=1ws,j(xj(s);ξ(s)|y(∖s)), (15)

where

 ws,i(xi(s);ξ(s)|y(∖s))=ms(xi(s)|y(∖s))qs,i(xi(s)|ξ(s),y(∖s))ηs(ξ(s)|xi(s),y(∖s)) (16)

and consider as an empirical approximation of . Based on , we define the estimator of as

 ˆENs,CIS(f|y(∖s)) :=N∑i=1Ws,i(xi(s);ξ(s),y(∖s))f(xi(s),y(∖s))=EˆπNs,CIS(⋅|y(∖s))(f). (17)

Analogously to the CIS, define the joint density of conditional on as

 ˜πNs(ks,y(s),x1:N(s),ξ(s)|y(∖s)):=πs(y(s)|y(∖s))NΓNs(x1:N(s),ξ(s)|y(s),ks,y(∖s)). (18)

Lemma 3 gives some properties of the density (18) and shows that if is generated from then the expectation of is .

Theorem 3.

Suppose be a sample from , and a sample from . Then, conditional on ,

• .

• The conditional density of given is

 ˜πNs(ks,y(s)|x1:N(s),ξ(s)) =Ws,ksT(xsk(s),y(s);ξ(s)) or equivalently ˜πNs(ks|x1:N(s),ξ(s)) =Ws,ks and ˜πNs(y(s)|x1:N(s),ξ(s),ks) =Ts(xks(s),y(s);ξ(s)).

4.2 The Markov Interacting Importance Sampler within Gibbs

The algorithm extends the MIIS sampler targeting the full density. It simulates sequentially from the conditional distributions , using the CIS approximation to the conditionals. The method is an alternative to the Metropolis-within-Gibbs algorithm that is is suitable for the application of the variance reduction techniques in Section 5. The MIIS within Gibbs sampler requires the specification of joint proposal distributions , auxiliary distributions , and Markov transition kernels , for each . The general form of the MIIS Gibbs sampler is given by Algorithm 6

Algorithm 6 (The Markov Interacting Importance Sampler within Gibbbs).

Given and , , the algorithm at step , is described as follows, with all terms conditional on and .

1. For ,

1. Generate .

2. Generate

 X(t)k(t−1)s(s)∼T(y(t−1)(s),x(t)k(t−1)s(s);ξ(t)(s))).
3. Generate

 X(t)∖k(t−1)s(s)∼qs,∖k(t−1)ss(x(t)∖k(t−1)s(s)∣∣x(t)k(t−1)s(s),k(t−1)s,ξ(t)(s)),

conditional on .

4. Draw with probability proportional to

 ws,k(x(t)k(s);ξ(t)(s))=ms(x(t)k(s))ηs(ξ(t)(s)|x(t)k(s))qs,k(x(t)k(s)|ξ(t)(s)).
5. Generate

 Y(t)(s)∼T(x(t)k(t)(s),y(t)(s);ξ(t)(s)).
2. Set .

For each partition , the algorithm iterates as in the MIIS algorithm. Steps 1.1 – 1.3 construct an approximation to . Steps 1.4 and 1.5 then draw an element from this approximation. As before, the MIIS for conditional distributions is a Gibbs sampler on an augmented space that contains all variables sampled in the CIS step. It also follows that the marginal distribution of is the original target . Theorem 4 shows the augmented target distribution and that it generates samples from .

Theorem 4 (Target Distribution).

The Markov Interacting Importance Sampler is a Gibbs sampler targeting the augmented distribution given by

 ˜πN(y,ξ,x1:N(1),…,x1:N(d),k1:d)=π(y)Ndd∏s=1ΓNs(x1:N(s),ξ(s)|y(s),ks,y(∖s)), (19)

and has as a marginal distribution of .

4.3 Example: MIIS within Gibbs with Simple Importance Sampling

The MIIS sampler takes the conditional distributions in the Gibbs sampler as the target distributions for the conditional importance samplers. Suppose that we use a simple importance sampling algorithm to construct the CIS approximation. Then, for each ,

 ΓNs(x