MCMC importance samplers
\nobibliography

*

\incgraph

[documentpaper, overlay=] [width=height=]jyu_cover_a4_1.pdf

\incgraph

[documentpaper, overlay=] [width=height=]jyu_cover_a4_2.pdf

## Abstract

Markov chain Monte Carlo (MCMC) is an approach to parameter inference in Bayesian models that is based on computing ergodic averages formed from a Markov chain targeting the Bayesian posterior probability. We consider the efficient use of an approximation within the Markov chain, with subsequent importance sampling (IS) correction of the Markov chain inexact output, leading to asymptotically exact inference. We detail convergence and central limit theorems for the resulting MCMC-IS estimators. We also consider the case where the approximate Markov chain is pseudo-marginal, requiring unbiased estimators for its approximate marginal target. Convergence results with asymptotic variance formulae are shown for this case, and for the case where the IS weights based on unbiased estimators are only calculated for distinct output samples of the so-called ‘jump’ chain, which, with a suitable reweighting, allows for improved efficiency. As the IS type weights may assume negative values, extended classes of unbiased estimators may be used for the IS type correction, such as those obtained from randomised multilevel Monte Carlo. Using Euler approximations and coupling of particle filters, we apply the resulting estimator using randomised weights to the problem of parameter inference for partially observed Itô diffusions. Convergence of the estimator is verified to hold under regularity assumptions which do not require that the diffusion can be simulated exactly. In the context of approximate Bayesian computation (ABC), we suggest an adaptive MCMC approach to deal with the selection of a suitably large tolerance, with IS correction possible to finer tolerance, and with provided approximate confidence intervals. A prominent question is the efficiency of MCMC-IS compared to standard direct MCMC, such as pseudo-marginal, delayed acceptance, and ABC-MCMC. We provide a comparison criterion which generalises the covariance ordering to the IS setting. We give an asymptotic variance bound relating MCMC-IS with the latter chains, as long as the ratio of the true likelihood to the approximate likelihood is bounded. We also perform various experiments in the state space model and ABC context, which confirm the validity and competitiveness of the suggested MCMC-IS estimators in practice.

## Foreword

The modern reliance on probability theory to model the universe and various aspects of life reveals on the one hand our tremendous lack of knowledge and ability to understand and hence predict the workings of the universe with Newtonian precision. On the other hand, the success of probability theory reveals the hidden order of the universe, as well as the significant deductive reasoning capacities of humankind, where from the disorder of incomplete knowledge arises the order of probabilistic laws. Statistics allows us to ascertain to what extent our deductive reasoning is justified by real observation. Statistics acts as the intermediary, allowing dialogue to proceed between our perceived knowledge of the (mechanistic and probabilistic) laws of the universe and of the universe as she presents herself to us in actual fact.

Of utmost importance for the development of statistics has been the increasing computational ability in the computer age [cf. 34]. As the speed of computers increases, so does the potential complexity of problems increase which statistical methods can handle with precision. Considerable interest therefore lies in the development of computational methods which are efficient and able to perform the demanding computational tasks of modern statistics. It is the scientific and humanistic hope for this thesis, that the work will serve to the advancement of human knowledge, and that it will be solely useful to the commendable pursuits of humankind.

As the fields of probability and statistics are intellectually challenging, any small progress in this field is dependent upon a stable, friendly, and stimulating working and living environment. First of all, I would like to thank Dr. Matti Vihola, for being a wonderful adviser, scientist, and person. In the beginning, I knew very little about computational statistics and Monte Carlo methods, but due greatly to Matti’s tremendous help and patience, my knowledge and skills in mathematics and statistics has grown considerably. This thesis would not have been possible without his help. The enclosed introductory text has also benefited greatly from his insightful remarks. As his first sole doctoral student, I have one of the early claims to be able to call him mathematico-statistical father.

As for a stable, friendly, and stimulating working and living environment, I would like to thank the University of Jyväskylä and its employees, for being able to pursue my doctoral studies here. The last three years have been enjoyable as a place to work, study, and live. Financial support is gratefully acknowledged from the Academy of Finland (‘Exact approximate Monte Carlo methods for complex Bayesian inference,’ grants 284513 and 312605, PI: Dr. Matti Vihola).

Sincere thanks to the reviewers, Prof. Marko Laine (Finnish Meteorological Institute) and Prof. Krzysztof Łatuszyński (University of Warwick).

There are many other individual persons whom I should thank for being a help these last few years. As I drew up an account of all the people whom I would like to thank, it became ever-expanding, touching every aspect and time of my life. I simply could not do proper justice to those who have helped me, and I would run the danger of leaving somebody unintentionally out. I therefore would simply like to thank the many precious people who have been a positive impact on me, without going into all the details here. They and there deeds are simply too many to be entrusted to these few pages.

I think the saying is true, and hope it is true: when someone has stayed somewhere long enough, the place becomes forever a part of the person. I wish to thank the many people in Jyväskylä whom I have enjoyed getting to know during these last few years. Language has not always been an insurmountable barrier. I will miss you, and I will miss Jyväskylä—the snow, the sauna, the summer, the lakes, the festivals, the food, the people—all of which make Finland a special place to live.

I mention regards to the researchers everywhere with whom I have had the privilege to meet. Statistics, like other scientific disciplines such as pure mathematics, involves many devotees interested in a common subject with undesirable distractions kept to a minimum. When immersed in a scientific subject, where validity is judged by logic and observation rather than might or necessity, when one is able to escape the day-to-day absorption of the human condition, then one is able to view the world from a new perspective. One sees like the astronaut, for whom, after seeing the Earth as the single terrestrial mass, life will never be the same.

Jordan Franks

Jyväskylä, Finland

April 4, 2019

## List of included articles

The thesis consists of an introductory text (Sections A-G) and the following articles listed in chronological order of preprint appearance.

## References

• [A] M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
• [B] J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
• [C] J. Franks, A. Jasra, K. J. H. Law and M. Vihola. Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
• [D] M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019.

In the introduction, the above articles are referred to by letters [A], [B], [C], and [D], whereas other article references cited in the introduction shall be referred to by numbers [1], [2], [3], etc.

The author of this dissertation has actively contributed to the research of the joint articles [A], [B], [C] and [D], and has served as the corresponding author for articles [B] and [C]. In particular, the author contributed some theoretical results for augmented Markov chains for [A], and assisted during the research and preparation of the article. Article [B] is mostly the author’s own work, but benefited greatly from the second author’s input, for example, regarding the topic of the paper and some technical details. The author was responsible for the efficiency considerations of multilevel Monte Carlo in [C], and for final preparation of the article. The author was responsible for the analysis of the adaptive Markov chain Monte Carlo algorithm in [D].

## Appendix A Introduction

Bayesian inference often requires the use of computational simulation methods known as Monte Carlo methods. Monte Carlo methods use probabilistic sampling mechanisms and ensemble averages to estimate expectations, such as those taken with respect to the posterior probability of a Bayesian model. Therefore, in practice on a computer, Monte Carlo methods can be computationally intensive.

A further inferential challenge arises when the likelihood function of the Bayesian model is intractable. In some important settings, it is possible to obtain an unbiased estimator for the likelihood. One such setting is the state space model, where sequential Monte Carlo supplies the unbiased estimator. In settings where unbiased estimators are not possible, approximate Bayesian computation (ABC) may be used, assuming forward generation of the model is possible and a choice of tolerance size has been made. Though only an approximation to the original Bayesian model, the ABC model comes equipped with a straightforward unbiased estimator for its ABC likelihood.

In these two example settings, a Markov chain can be run, allowing for Markov dependence in the samples, as well as the use of the unbiased estimators for the (ABC) likelihood as part of a pseudo-marginal algorithm. As a result, the samples of the Markov chain are drawn asymptotically from the (ABC) posterior, and inference is based on averaging the samples obtained. This computational approach to Bayesian inference is known as Markov chain Monte Carlo (MCMC).

This thesis is concerned with a slightly different approach, namely, where the Markov chain targets an approximate marginal of the (ABC) posterior. The subsequent importance sampling (IS) type correction is performed by a reweighting of the inexact sample output, using the unbiased estimators, which yields asymptotically exact inference for the (ABC) posterior. The use of an approximation for the Markov chain target can be computationally efficient, as can be the parallel calculation of the IS weights on a parallel computer. Some of the resulting MCMC-IS estimators are well-known, but in practice have been used only rarely, in comparison to direct MCMC. In addition, the MCMC-IS approach is shown to offer additional flexibility compared to direct MCMC.

The rest of this Section A is laid out as follows. We present some important notions, such as that of a statistical model, likelihood function, and Bayesian model. We briefly describe the general goal of (likelihood-based) parameter inference in statistics, as well as some of the challenges of computation which the thesis seeks to address, specifically inference aided by use of an approximation. Section A.5 concludes with an outline of the remainder of the text.

### a.1. Likelihoods

A statistical model is composed of an observational space , together with its -algebra of subsets , and a set of probability distributions on [cf. 38]. We assume the family is parametrised by a vector of model parameters , with for some . That is,

 P={p(θ)}θ∈T,

where is a probability on , sometimes called the data distribution. The probability corresponds to a modeling of the dependency relationship of the observation , considered as a random variable, on the model parameter .

We assume for simplicity in this introduction that has a density, also denoted , with respect to a -finite reference measure on . Fixing the observation , we define the function

 L(θ):=p(θ)(y),

which is known as the likelihood. One type of likelihood-based inference for is to answer which values of maximise . This method of inference is known as maximum likelihood estimation (MLE) in a statistical model with observation [cf. 18]. In other words, MLE seeks to answer, which probability distribution on in would most readily give rise to the observation.

### a.2. Bayesian inference

In practice, MLE is highly dependent on the candidate set of probabilities to consider. The set could be parametrised by arbitrarily high dimensions of parameters, and is the result of the statistician’s modeling of the dependence of the observation on the model parameter . Going further, in light of this arbitrary construction of the set , the statistician is arguably111The frequentist approach differs from the Bayesian approach considered here [cf. 34]. not out of bounds to specify which values are to be considered more probable and with more weight, based on prior knowledge or hypothesis.

This specification, for a statistical model with known observation, leads to the Bayesian model [cf. 37]. The Bayesian model consists of an assignment of a prior probability to the model parameters, with density also denoted . Inference for the Bayesian model then consists of quantification of the posterior probability

 π(θ):=p(θ|y)=L(θ)pr(θ)p(y), (1)

where the last equality, giving the posterior in terms of the likelihood and prior, is the practically useful formula of Bayes, and is the model evidence, defined by

 p(y):=∫L(θ)pr(θ)dθ.

### a.3. Challenges for inference

In statistical models of practical interest, the likelihood is often intractable, meaning that it can not be evaluated pointwise. However, in many settings which we consider, we will see that can be estimated unbiasedly, meaning it is possible to generate a random variable such that . However, construction of a reliable unbiased estimator may be neither directly available, nor efficient.

The posterior of the Bayesian model is in general intractable, and can not even be estimated unbiasedly. This is often the case even if the likelihood is tractable, because of the normalisation by the model evidence in (1), which is usually computationally intensive to calculate. In case of intractable likelihood in the Bayesian model, posterior inference is even more of a challenge, and one must usually rely on ergodic averages from Markov chain Monte Carlo (MCMC). Such averages are generally asymptotically exact (i.e. consistent) as the number of iterations of the MCMC algorithm increases, in the sense of a law of large numbers. However, MCMC can be computationally expensive to run. It can take hours, days, weeks, or longer, in order to ensure a ‘reliable’ MCMC estimate, where the level of reliability can be theoretically difficult to justify.

### a.4. Approximate families

We will see that the use of approximations can help facilitate tractable, efficient and user-friendly inference. Let denote a set of (ideal) model probability distributions on . In many cases in practice, it may be desirable to work with a surrogate family of probability distributions . Going further, it may be desirable to work with a family

 Pℓ={p(θ,ℓ)}θ∈T,

of data (probability) distributions, with used to indicate families of increasingly ‘better’ approximations. For example, inference using may be too difficult to achieve or too costly, in which case using an approximate family may be possible instead.

It is conceivably possible to incorporate in a Bayesian inference method, which may lessen the computational cost of the algorithm, while in the end performing inference for . The aim of this thesis is to show general strategies in different settings where this is possible.

### a.5. Overview

We now outline the remainder of this text222As for the intended audience, in order to keep the text of moderate size we must suppose some notions from analysis [cf. 84], probability [cf. 38], simulation [cf. 65], and statistics [cf. 37]. We try to strike a balance, to make the text of interest both to those knowledgeable and less knowledgeable in the subject matter of the thesis.. The text seeks to serve as an introduction and summary for the thesis papers listed on page List of included articles. Methodological aspects are stressed for this introduction to the articles, as are some of the supporting theoretical results. Most of the details are left to the articles. For this introductory text, we do not give algorithms and results in full generality and for all cases. Rather we focus on a few important cases. For example, we consider only a few specific Markov chains, rather than general Harris ergodic chains for the IS correction, and we focus on the use of unbiased estimators from particle filters333also known as sequential Monte Carlo in state space models, rather than from general importance sampling distributions in latent variable models. Some more generality is provided in the original articles listed on page List of included articles.

We begin in Section B with a specific problem of intractable likelihoods for statistical models, namely, that of the state space model, and review how interacting particle systems known as particle filters [48, 98] can lead to unbiased estimators of the -likelihood (the likelihood corresponding to the family ), as long as the dynamics of the state space model can be simulated. We also detail an MCMC known as the particle marginal Metropolis-Hastings (PMMH) [5] (see also [9, 13, 63, 73]), which allows for -inference for the corresponding Bayesian model posterior, when unbiased -likelihood estimators are available.

In Section C, as in [A] we consider two different MCMCs, which are intended for acceleration of PMMH, and which are based on use of an approximate family and unbiased estimators for the -likelihood. These are the delayed acceptance (DA) MCMC [cf. 63, 12, 20, 21, 45, 65] and the MCMC-IS [cf. 28, 41, 42, 52, 77, A], both of which allow for unbiased estimators of the -likelihood and -likelihood, which can be useful when deterministic approximations are not available444The references [45] for DA and [A] for MCMC-IS are most relevant in the unbiased estimator context for intractable likelihoods.. Based on an extension of the covariance ordering [71] to the IS setting, with differing reversible stationary probabilities and with unbiased estimators, we seek to compare the algorithms in terms of statistical efficiency, as in [B].

Section D is concerned with a discretely and partially observed Itô diffusion, where unbiased -likelihood estimates can not be directly obtained by the particle filter, because the dynamics of the diffusion can not be simulated. Instead, approximate families based on Euler approximation [cf. 60] are used, along with multilevel [53, 40], randomisation [68, 81] and particle filter coupling [57] techniques, leading to an unbiased estimator for the -likelihood and to the Bayesian -posterior by using an MCMC-IS with randomised weights, as in [C].

Section E is concerned with Bayesian models with intractable likelihoods, where an unbiased estimator of the likelihood is not readily available, but where it is at least possible to generate artificial observations from . The approach of approximate Bayesian computation [cf. 96] is to use families of approximations to , where is indexed by , the ‘tolerance,’ which is difficult to choose. We detail an approach based on an adaptive MCMC, as well as MCMC-IS [102], with approximate confidence intervals for post-correction to finer tolerance, as in [D].

We close with a brief discussion of ideas for future work in Section F and provide expanded individual summaries for the thesis papers in Section G.

## Appendix B Bayesian inference for state space models on path space

We introduce a well-known class of models based on latent variables on a state space and conditionally independent observations on which is sufficiently general and motivates a main application area of Articles [A], [B] and [C] based on unbiased estimators and approximate families .

### b.1. Discretely-observed continuous-time path-dependent process

To motivate this general class of models, we consider an example of continuous-time latent process. Suppose there is a process of latent or hidden states , where depends on and the model parameter . Also, suppose is another process (of observations), where depends on and . We make the realistic assumption555since the continuum can not easily be recorded by electronic or other physical means that only finitely many observations are gathered at observation times .

Let us set and . Let us define , and for fixed parameter value , consider the following dependency structure involving conditionally independent observations:

 \tikzcd[ampersandreplacement=&]⋯\ar[r]&X0:p−1\ar[d]\arr&X0:p\ard\ar[r]&X0:p+1\ar[d]\ar[r]&⋯&Yp−1&Yp&Yp+1&

Here, the arrows denote a dependency relationship, described in the following, where the initial state is drawn from an initial distribution . The dynamics between states (on path space) and is defined by a Markov probability kernel (on path space), where

 ¯M(θ,∞)p(x0:p−1,dx′0:p):=1{x′0:p−1=x0:p−1}M(θ,∞)p(x0:p−1,dx′p),

where is a Markov probability kernel from to induced by the dynamics of the path-dependent continuous-time process. The observations are obtained via , where is the observational density.

Let us set as shorthand and

 G(θ)p(x0:p):=g(θ)p(yp|x0:p).

The model described above in terms of the pair is known as a path-dependent state space model (SSM)666As SSM is also known as a hidden Markov model [cf. 18], especially in the engineering disciplines., or, more succinctly, as a Feynman-Kac model [cf. 22].

Simulation methods for the -Feynman-Kac model are impossible if the dynamics can not be simulated exactly. Besides some (important) exceptions, this is in general the case for continuous-time latent processes [cf. 23, Sect. 1.3]. However, we will see when we consider Itô diffusions in Section D, that often one can obtain Euler type approximations of the original process, with precision denoted ‘’, leading to approximate dynamics between observation times [cf. 23, 60]. Using the same observational densities as for the exact model, we obtain a Feynman-Kac model derived from the Euler type approximation of the dynamics.

### b.2. Model probabilities

We now describe some of the probabilities associated to a Feynman-Kac model .

First, we define a bit of standard notation from analysis. If is a probability measure and , we denote by the Banach space of real-valued functions , modulo equivalence in norm, with finite norm

 μ(|ϕ|s)1s<∞,whereμ(ϕ):=∫ϕ(x)μ(dx). (2)

Consider now the conditional -model probability on the latents, or -smoother,

 p(θ,ℓ)(dx0:n)=p(θ,ℓ)u(dx0:n)p(θ,ℓ)u(1), (3)

where777In the notation , we view as the function , and the integral as in (2) for .

 p(θ,ℓ)u(dx0:n)=(n∏p=0G(θ)p(x0:p))η(θ)0(dx0)n∏p=1M(θ,ℓ)p(x0:p−1,dxp). (4)

Then represents the probability to observe the latent states given the observations according to the Feynman-Kac model . In terms of a statistical model888really on , but we view as fixed and therefore disregarded in the notation on , we have with defined in (3), for the Feynman-Kac model.

The joint -posterior probability for the Bayesian model over model parameters and latent states is then

 π(ℓ)(dθ,dx0:n)∝pr(dθ)p(θ,ℓ)u(dx0:n). (5)

Writing the marginal -likelihood as and considering the marginal -posterior on , we obtain a more familiar formula to (1) given in the beginning in Section A.2, namely,

The main topic of this thesis is incorporation of -approximation within a -inference method, to obtain efficient and user-friendly inference with respect to and .

### b.3. Particle filter

Ignoring the and labels, we have seen that a Feynman-Kac model (with time horizon ) is defined through a pair , where,

1. is a Markov ‘transition’ kernel for , and is a probability measure, and

2. is a nonnegative ‘potential’ function for .

The particle filter (PF) (Algorithm 1) was popularised in [e.g. 98, 48], and allows for unbiased estimation [cf. 22, 32] of

 pu(dx0:n)=(n∏p=0Gp(x0:p))η0(dx0)n∏p=1Mp(x0:p−1,dxp), (6)

for (traditional) SSMs that are not path-dependent, that is,

 Mp(X0:p−1,dx′p) =Mp(Xp−1,dx′p), (7) Gp(X0:p) =Gp(Xp). (8)

However, straightforward generalisation also allows for unbiased estimators in the path-dependent setting of Feynman-Kac models, at least when the dynamics can be simulated [cf. 22]. In addition, as is well-known, the general resampling scheme in PF (Algorithm 1) for ancestor random variables do lead to unbiased estimators, since the equality

 E[N∑k=11{A(k)p=i}]=NV(i)pV∗p,

is assumed satisfied for all in PF (Algorithm 1) [cf. A, Prop. ]. Such resampling schemes include the popular multinomial, stratified, residual, and systematic resampling [cf. 18, 29, 32].

The unbiased estimator, from the output of PF (Algorithm 1) run for Feynman-Kac model , is obtained by setting

 ^pu(ϕ):=N∑i=1V(i)nϕ(X(i)), (9)

for , which satisfies

 E[^pu(ϕ)]=pu(ϕ). (10)

An important point is that particle approximations , for through the PF for the model , are not unique [cf. 22, Sect. 2.4.2]. One standard way to obtain a different particle approximation is merely changing the Feynman-Kac model to , but in such a way that

 (n∏p=0~Gp(x0:p))~η0(dx0)n∏p=1~Mp(x0:p−1,dxp)=(n∏p=0Gp(x0:p))η0(dx0)n∏p=1Mp(x0:p−1,dxp). (11)

holds, and running the PF (Algorithm 1) for the new Feynman-Kac model. From (6) and (10), it follows that the unbiased estimator from the PF run for delivers the same unbiased estimation for corresponding to the model . As an example for , consider

 ~Gp(x0:p):=Gp(x0:p)Mp(x0:p−1,dxp)~Mp(x0:p−1,dxp)(x0:p)

in the sense of a Radon-Nikodým derivative [cf. 94], which always exists if and admit densities and a support condition holds.

This non-uniqueness opens the door to consider more efficient PF implementations for a particular model and filtering/smoothing problem [cf. 22, 32, 49, 79]. The question of the optimal choice of for the smoothing problem (i.e. unbiased estimation of ) has been considered in [49]. As the optimal choice is usually not implementable, [49] suggest an adaptive iterative algorithm, based on approximating families of mixtures of normals, in order to approximately find and (see also e.g. [54] for a related method). Deterministic approximations, such as Laplace approximations [cf. 87], can also be used as a substitute for the optimal transition [A] (see also [64]). We emphasise that all the above mentioned approaches to the optimal choice problem achieve unbiased estimation of , as they use appropriately weighted potentials so that (11) holds.

Latent inference with respect to is possible through the PF, at least when the dynamics can be simulated, by using a ratio estimator targeting . That is, if with are formed999Traditionally in particle filtering [cf. 32], latent inference (12) is done with , possibly with a final resampling to form uniformly weighted particles, but final resampling leads to higher variance of the resulting estimator and is unnecessary here. as in (9) from independent runs of PF (Algorithm 1) for , then

 ∑mk=1^pu,k(ϕ)∑mk=1^pu,k(1)m→∞−−−−→p(θ)(ϕ),almost surely. (12)

We remark that the above estimator (12), as mentioned for example in [19, Eq. 1], is an IS analogue of the ‘particle independent Metropolis-Hastings’ (PIMH) [5] chain for latent smoothing. The algorithm based on (12) is completely parallelisable and does not depend on mixing of a chain, and is therefore relatively resilient in the number of particles . Straightforward consistent estimators to construct confidence intervals are also available [cf. A, Prop. 23].

### b.4. Particle marginal Metropolis-Hastings

The main task for which we are interested is joint -inference with respect to . So far, we only have shown how to perform -inference for and , with fixed. Surprisingly [cf. 9, 13, 63, 73], joint inference is possible, using an MCMC known as the particle marginal Metropolis-Hastings (PMMH) [5].

Assuming for all in the PMMH chain (Algorithm 2), or a similarly mild condition ensuring Harris ergodicity of the chain [cf. 70], the estimator formed from PMMH is strongly consistent: for ,

 EPMm(f):=1mm∑k=1N∑i=1V(i)kf(Θk,X(i)k)N∑j=1V(j)km→∞−−−−→π(∞)(f),a.s.\lx@notefootnotealmostsurely (14)

We remark about ‘Metropolis-Hastings type’ MCMC. The PMMH [5] is used in state space models using PFs, pseudo-marginal MCMC [9, 13, 63, 73] is the general term for the chain used in latent variable models with unbiased estimators, and Metropolis-Hastings MCMC [69, 52] is used in Bayesian models with tractable likelihoods. In fact, it is possible to view these ‘Metropolis-Hastings type’ MCMCs each as a substantiation of the other: one direction follows by viewing the pseudo-marginal MCMC and PMMH as full-dimensional Metropolis-Hastings kernels on an extended state space, while the other direction follows by trivialisation [cf. 9].

## Appendix C Accelerations based on an approximation

The Metropolis-Hastings MCMC has served as the backbone of the MCMC revolution for half of the last century [27], while pseudo-marginal MCMC and the PMMH have been quite popular and extensively used in the current century (see [B, Sect. ] for a review). Because of the importance of these MCMCs, there has been considerable interest in their possible acceleration. We focus on acceleration of the PMMH in the following.

Usually by far the most computationally intensive part of the PMMH is running the PF (Algorithm 1), for with output , to obtain the unbiased estimator

 ^L(∞)(θ):=^p(θ,∞)u(1)=N∑i=1V(i)

of the likelihood . The idea of acceleration based on approximation is to substitute a computationally cheaper (non-negative unbiased estimator of an) approximation for the -likelihood, instead of using . One would also like to maintain (strong) consistency of the resulting estimator for the -posterior.

### c.1. Delayed acceptance and importance sampling

One such popular acceleration algorithm is delayed acceptance (DA) (Algorithm 3) [cf. 63, 12, 20, 21, 45, 65], with .

We require that almost surely the support condition

 (17)

holds, so that the resulting weight in Algorithm 3(ii) is guaranteed well-defined. This can be simply achieved always by choosing a regularisation constant111111This will be done in Algorithm 6 given later, and is linked to ‘defensive importance sampling’ [55]. , leading to asymptotically exact -inference. We note that step (i) in DA (Algorithm 3) effectively acts as a screening stage: only ‘good’ proposals proceed to step (ii), where the expensive -model PF must be run. The resulting DA estimator for the -posterior is the same as that of PMMH, given in (14).

As an alternative to PMMH/DA, we consider MCMC-IS (Algorithm 4) [cf. 28, 41, 42, 52, 77, A]. Here, for we have set .

Assuming the Phase 1 chain is Harris ergodic (e.g.  for all ) and the support condition (17) holds, like the PMMH/DA estimator, the MCMC-IS estimator is strongly consistent [A, Thm. 3]: for ,

 EISm(f)m→∞−−−−→π(∞)(f),% almost surely.

Phase 1 of MCMC-IS (Algorithm 1) implements a PMMH (Algorithm 2) targeting marginally

 π(0)m(θ)∝pr(θ)L(0)(θ).

Phase 2 consists of independent calls of PF (Algorithm 1), and is therefore completely parallelisable, unlike DA (Algorithm 3). This allows for the possibility of substantial additional speedup on a parallel computer [cf. 62].

We remark about an acceleration technique known as ‘early rejection’ [97] for Metropolis-Hastings, that can sometimes be employed if the likelihood takes a special form, described below.121212A similar idea of early cancellation as ‘early rejection’ has been used previously in the exact simulation literature, under the name of ‘retrospective simulation’ [14]. The acceleration technique also applies to DA step (i) and MCMC-IS Phase 1, if almost surely and . The form required in [97] is that the -likelihood can be written, for example, as

with . In this case, because the likelihood only gets smaller with more components of the product computed, the calculation of the components can be ended and the proposal rejected early in acceptance probability (15) for DA and MCMC-IS, as soon as the partially computed acceptance probability in (15) becomes smaller than the uniformly generated random variable [cf. 97, Sect. 4]. The ‘early rejection’ trick requires a special form for the likelihood, however, and therefore is not always applicable.

### c.2. The question of relative efficiency

The delayed acceptance and importance sampling correction are two acceleration alternatives to the standard PMMH, both of which use the same approximation and algorithmic ingredients. The question of choice of alternative methods has been remarked before [21] in the simpler setting of Metropolis-Hastings, without unbiased estimators. Article [A] introduces the IS correction in the general case of unbiased estimators in both Phase 1 and Phase 2, and seeks to compare MCMC-IS with DA in the general setting.

A numerical comparison of the methods is done in [A], where the MCMC-IS approach was found to work slightly better than DA in experiments in SSMs, even without parallelisation. As an example of a computationally intensive experiment, a stochastic volatility model was considered with observation consisting of real data from daily financial index returns spanning two decades. Laplace approximations were used to approximate the -likelihood, and were used as well in the IS correction, namely, for the approximation to the optimal choice131313as discussed in Section B.3 of Feynman-Kac model for the smoothing problem for . With all methods making intelligent use of the Laplace approximations, MCMC-IS performed significantly better than PMMH or DA in the experiment.

In additional to the experiments, many additional potential enhancements were suggested in [A] which would improve the computational efficiency of MCMC-IS in practice, relative to DA acceleration of PMMH, even further. For example, the Phase 2 IS weights do not need to be calculated during the burn-in phase141414Additionally, the debiasing tricks [cf. 43] may be effectively and efficiently used. and for thinned out samples of the chain151515Thinning [cf. 76] denotes the procedure, in which only every sample of the Markov chain is kept, with say , in order to decrease the auto-correlation of samples., nor for repeated samples of the chain if the jump chain161616the chain formed formed from the original chain, consisting only of the accepted states of the original chain [cf. 31, A] is used. As well, as previously mentioned, Phase 2 admits a straightforward parallelisation for calculation of the more expensive IS weights, which significantly increases the scalability and efficiency of MCMC-IS.

### c.3. Peskun and covariance orderings of asymptotic variances

An estimator is said to satisfy a central limit theorem (CLT), if

 √m[Em(f)−π(∞)(f)]m→∞−−−−→N(0,σ2(f)),in distribution.

In this case, we call the asymptotic variance of the estimator.

Without taking into account computational factors previously mentioned (which generally support the use of MCMC-IS; see also Section D.6), and considering just the statistical efficiency of the estimators in terms of the asymptotic variance, it was found in [B] through artificially constructed toy examples that either MCMC-IS or PMMH/DA may do arbitrary better than the other. Moreover, the examples seemed to indicate that MCMC-IS might do better in cases of practical interest, with multi-modal targets, a phenomenon remarked previously about MCMC-IS and Metropolis-Hastings [e.g. 41]. Proving that the IS acceleration is usually ‘better’ than DA is of course a separate matter, which can not be done based on experiments or examples alone.

We first introduce some notation and terminology. A Markov kernel on is said to be reversible with respect to a probability , if for all ,

 ∫μ(dx)K(x,dy)1{x∈A,y∈B}=∫μ(dy)K(y,dx)1{x∈A,y∈B}.

We also define the Dirichlet form

 EK(g):=⟨g,(1−K)g⟩μ

for , where , and .

The famous Peskun ordering [78, 99] says that if

 K(x,A∖{x})≥L(x,A∖{x})μ-almost every % x∈X,∀A∈X, (19)

where and are two Markov kernels, both reversible with respect to a probability , then

 σ2K(f)≤σ2L(f)∀f∈L2(μ), (20)

where and denote the asymptotic variances of the and chains, respectively.

Consider next a popular Peskun ‘type’ comparison result for asymptotic variances of reversible chains, known as the covariance ordering171717Though not mentioned by name, it was shown already in [99, Proof of Lem. 3] that the Peskun ordering is equivalent with the ‘covariance’ ordering. [71]: if and are two Markov kernels, both reversible with respect to a probability , and if

 EK(g)≥EL(g),∀g∈L2(μ), (21)

then

 σ2K(f)≤σ2L(f)∀f∈L2(μ). (22)

Compared to the Peskun ordering, the covariance ordering can be more useful in practice, as the criterion can distinguish better between chains on general state spaces. For example, some chains vanish along the diagonal, in which case (19) may be useless, but (21) may still be able to distinguish between these chains [cf. 71, 72].

As a simple application of the covariance ordering, let us consider the case of PMMH and DA, which are both reversible with respect to the same invariant measure (see [12] or Section C.5). Using the identity

 EL(g)=12∫μ(dx)L(x,dy)(g(x)−g(y))2,

which holds for any -reversible kernel , and that the product of the acceptance probabilities (15) and (16) in DA (Algorithm 3) is less than or equal to the acceptance probability (13) in PMMH (Algorithm 2), it can be shown [cf. 12] that the covariance ordering implies

 σ2PM(f)≤σ2DA(f).

### c.4. Peskun type ordering for importance sampling correction

Article [B] is concerned with extending the covariance ordering to chains and reversible with respect to probabilities and , where and may be different.

Suppose then that and are Harris ergodic chains on a space , where is -reversible and is -reversible. Suppose further that the Radon-Nikodým derivative181818This is the function satisfying for all .

 w(x):=dΠ(∞)dΠ(0)(x)

exists. Let be constants such that

 c–EK(g)≤ EL(g)≤¯cEK(g) c–≤ w(x)≤¯c,

for all and . Then [B, Thm. 2], for all with