The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data

The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data

\fnmsMurray \snmPollock\thanksrefe1    label=e1 [    mark]m.pollock@warwick.ac.uk    \fnmsPaul \snmFearnhead\thanksrefe2    label=e2 [    mark]p.fearnhead@lancs.ac.uk    \fnmsAdam M. \snmJohansen\thanksrefe3 label=e3 [    mark]a.m.johansen@warwick.ac.uk    \fnmsGareth O. \snmRoberts\thanksrefe4 label=e4 [    mark]gareth.o.roberts@warwick.ac.uk University of Warwick
Abstract

This paper introduces a class of Monte Carlo algorithms which are based upon simulating a Markov process whose quasi-stationary distribution coincides with a distribution of interest. This differs fundamentally from, say, current Markov chain Monte Carlo methods in which we simulate a Markov chain whose stationary distribution is the target. We show how to approximate distributions of interest by carefully combining sequential Monte Carlo methods with methodology for the exact simulation of diffusions. Our methodology is particularly promising in that it is applicable to the same class of problems as gradient based Markov chain Monte Carlo algorithms but entirely circumvents the need to conduct Metropolis-Hastings type accept/reject steps whilst retaining exactness: we have theoretical guarantees that we recover the correct limiting target distribution. Furthermore, this methodology is highly amenable to big data problems. By employing a modification to existing naïve sub-sampling and control variate techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.

\kwd
\startlocaldefs\endlocaldefs\runtitle

Scalable Langevin Exact Algorithm (ScaLE)

{aug}

and

\printead

e1,e2
\printead*e3,e4

Control variates \kwdImportance sampling \kwdKilled Brownian motion\kwdLangevin diffusion \kwdMarkov chain Monte Carlo \kwdQuasi-stationarity \kwdSequential Monte Carlo

\TMResetAll

1 Introduction

Advances in methodology for the collection and storage of data have led to scientific challenges and opportunities in a wide array of disciplines. This is particularly the case in Statistics as the complexity of appropriate statistical models and those we wish to fit often increases with data size. Many current state-of-the-art statistical methodologies have algorithmic cost that scales poorly with increasing volumes of data. As noted by [34], ‘many statistical procedures either have unknown runtimes or runtimes that render the procedure unusable on large-scale data’ and has resulted in a proliferation in the literature of methods ‘…which may provide no statistical guarantees and which in fact may have poor or even disastrous statistical properties’.

This is particularly keenly felt in computational and Bayesian statistics, in which the standard computational tools are Markov chain Monte Carlo (MCMC), Sequential Monte Carlo (SMC) and their many variants (see for example [51]). MCMC methods are exact in the (weak) sense that they construct Markov chains which have the correct limiting distribution. Although MCMC methodology has had considerable success in being applied to a wide variety of substantive areas, they are not well-suited to this new era of ‘big data’ as their computational cost will increase at least linearly with the number of data points:

  1. Most MCMC algorithms require at least one likelihood evaluation per iteration. For example, Metropolis-Hastings algorithms require likelihood evaluations in order to perform the accept/reject step. Given a data set of size , this will be at best an calculation. For algorithms which avoid accept/reject steps, such as Gibbs samplers, alternative calculations such as the evaluation of suitable sufficient statistics are required.

  2. A common approach in MCMC for missing data and latent variable models is data augmentation [55]. In the context of MCMC for large data sets this approach increases the dimensionality of the state space, and in some cases this increase will be . As we demonstrate in Section 5.2 we are able to circumvent this issue.

Naturally, owing to the previous successes of MCMC, there has been considerable effort to modify existing methodology in order to address computational scalability – and the motivation behind the work presented in this paper is on developing Monte Carlo methods that are still exact in the same sense as MCMC but that have a have a computational cost that is sub-linear in the number of data points.

To date the success of methods that aim to adapt MCMC so as to reduce its algorithmic cost has been mixed, and has invariably led to a compromise on exactness — such methodologies generally construct a stochastic process with limiting distribution which is (at least hopefully) close to the desired target distribution. Broadly speaking these methods can be divided into three classes of approach: ‘Divide-and-conquer’ methods; ‘Exact Sub-sampling’ methods; and, ‘Approximate Sub-sampling’ methods. Each of these approaches has its own strengths and weaknesses, and we will briefly review these in the following paragraphs.

Divide-and-conquer methods (for instance, [44, 58, 52, 42]) begin by splitting the data set into a large number of smaller data sets (which may or may not overlap). Inference is then conducted on these smaller data sets and resulting estimates are in some manner combined. A clear advantage of such an approach is that inference on each small data set can be conducted independently, and in parallel, and so if one had access to a large cluster of computing cores then the computational cost could be significantly reduced. Interaction between the cores is typically impracticable as the primary bottleneck is not computational speed, but latency. Therefore, from a parallelisation perspective this approach is perhaps the most appealing. The primary weakness of these methods is that the recombination of the separately conducted inferences is inexact. All current theory is asymptotic in the number of data points, [44, 39]. For these asymptotic regimes the posterior will tend to a Gaussian distribution [33], and it is questionable whether divide-and-conquer methods offer an advantage over simple approaches such as a Laplace approximation to the posterior [6]. Most results on convergence rates (e.g. [53]) have rates that are of the , where is the number of data-points in each sub-set. As such they are no stronger than convergence rates for analysing just a single batch. One exception is in [39], where convergence rates of are obtained, albeit under strong conditions. However, these results only relate to estimating marginal posterior distributions, rather than the full posterior.

Sub-sampling methods are designed so that each iteration requires access to only a subset of the data. Exact approaches in this vein typically require subsets of the data of random size at each iteration. One approach is to construct unbiased estimators of point-wise evaluations of the target density using subsets of the data, which could then be embedded within the pseudo-marginal MCMC framework recently developed by [2]. Unfortunately, the construction of such positive unbiased estimators is not possible in general [31] and such methods often require both bounds on, and good analytical approximations of, the likelihood [41].

More promising practical results have been obtained by approximate sub-sampling approaches. These methods use subsamples of the data to estimate quantities such as acceptance probabilities [45, 38, 5], or the gradient of the posterior, that are used within MCMC algorithms. These estimates are then used in place of the true quantities. Whilst this can lead to increases in computational efficiency, the resulting algorithms no longer target the true posterior. The most popular of these algorithms is the stochastic gradient Langevin dynamics algorithm of [59]. This approximately samples a Langevin diffusion which has the posterior as its stationary distribution. To do this requires first approximating the continuous-time diffusion by a discrete-time Markov process, and then using sub-sampling estimates of the gradient of the posterior within the dynamics of this discrete-time process. This idea has been extended to approximations of other continuous-time dynamics that target the posterior [1, 15, 40].

Within these sub-sampling methods it is possible to tune the subsample size, and sometimes the algorithm’s step-size, so as to control the level of approximation. This leads to a trade-off, whereby increasing the computational cost of the algorithm can lead to samplers that target a closer approximation to the the true posterior. There is also substantial theory quantifying the bias in, say, estimates of posterior means, that arise from these methods [56, 57, 14, 30], and how this depends on the subsample size and step-size. However, whilst they often work well in practice it can be hard to know just how accurate the results are for any given application. Furthermore, many of these algorithms still have a computational cost that increases linearly with data size [6, 43, 4].

The approach to the problem of big data we propose is a significant departure from the current literature. Rather than building our methodology upon the stationarity of appropriately constructed Markov chains, we develop a novel approach based on the quasi-limiting distribution of suitably constructed stochastically weighted diffusion processes. A quasi-stationary distribution for a Markov process with respect to a Markov stopping time is the limit of the distribution of as [18], and is completely unrelated to the popular area of Quasi-Monte Carlo. These Quasi-Stationary Monte Carlo (QSMC) methods which we develop can be used for a broad range of Bayesian problems (of a similar type to MCMC) and exhibit interesting and differing algorithmic properties. The QSMC methods we develop are exact in the same (weak) sense of MCMC, in that we have the correct (quasi-)limiting distribution. There are a number of different possible implementations of the theory which open up interesting avenues for future research, in terms of branching processes, by means of stochastic approximation methods, or (as outlined in this paper) SMC methods. One particularly interesting difference between our class of Monte Carlo algorithms and MCMC is that QSMC methods allow us to circumvent entirely the Metropolis-Hastings type accept/reject steps, while still retaining theoretical guarantees that the correct limiting target distribution is recovered. In the case of big data problems, this removes one of the fundamental bottlenecks in computation.

Our Quasi-Stationary Monte Carlo methods can be applied in big data contexts by using a novel sub-sampling approach. We call the resulting algorithm the Scalable Langevin Exact Algorithm (ScaLE). The name refers to the ‘Langevin’ diffusion which is used in the mathematical construction of the algorithm, although we should emphasise that it is not explicitly used in the algorithm itself. As we show in Section 4, our approach to sub-sampling can potentially decrease the computational complexity of each iteration of QSMC to be . Furthermore, for a rejection sampler implementation of QSMC, the use of sub-sampling introduces no additional error – as the rejection sampler will sample from the same stochastic process, a killed Brownian motion, regardless of whether we use sub-sampling or not. There can be a computational cost of using sub-sampling, as the number of iterations needed to simulate the killed Brownian motion for a given time interval will increase. However, we show that by using control variates [6] to reduce the variability of sub-sampling estimators of features of the posterior the increase in the number of iterations will be by a factor that is . Constructing the control variates involves a pre-processing step whose cost is , but after this pre-processing step the resulting cost of ScaLE per effective sample size can be . The importance of using control variates to get a computational cost that is sub-linear in is consistent with other recent work on scalable Monte Carlo methods [30, 10, 49, 27, 43, 4].

The next section presents our main result that motivates our development of quasi-stationary Monte Carlo. The following sections then develop how we can implement QSMC algorithms in practice, and how and why they are amenable to use with sub-sampling ideas. For clarity of presentation we have suppressed much of the technical and algorithmic detail, but this can be found in the appendices.

2 Quasi-stationary Monte Carlo

Given a target density on , traditional (i.e. Metropolis-Hastings type) MCMC proposes at each iteration from Markov dynamics with proposal density , ‘correcting’ its trajectory by either accepting the move with probability

(1)

or rejecting the move and remaining at state . In quasi-stationary Monte Carlo, rather than rejecting a move and staying at , the algorithm kills the trajectory entirely, according to probabilities which relate to the target density.

If we simulate such a Markov process with killing, then eventually the process will die. Thus it is natural to describe the long-term behaviour of the process through its conditional distribution given that the process is still alive. The limit of this distribution is called the quasi-stationary distribution (see, for example, [18]). The idea of quasi-stationary Monte Carlo is to construct a Markov process whose quasi-stationary distribution is the distribution, , which we wish to sample from. We will then use simulations from such a process to give us samples suitable for approximating , in a similar manner as we would do in MCMC.

Although in principle QSMC can be used with any Markov process, within this paper we shall work exclusively with killed Brownian motion as it has a number of convenient properties that we can exploit. Therefore let denote -dimensional Brownian motion initialised at . Suppose denotes a non-negative hazard rate at which the Brownian motion is killed when it is in state , and let be the killing time itself. Finally we define

(2)

the distribution of given that it has not yet been killed. The limit of this distribution as is the quasi-stationary distribution of our killed Brownian motion.

We are interested in choosing in such a way that converges to . To this end, we introduce the function

(3)

where denotes the usual Euclidean norm and the Laplacian. By further imposing the condition

Condition 1 (). 

There exists a constant such that .

then we can establish the following:

Theorem 1. 

Under the regularity conditions (22) and (23) in Appendix A, suppose that Condition 1 holds and set

(4)

then it follows that converges in and pointwise to .

Proof.

See Appendix A. ∎

Note that the regularity conditions in Appendix A are largely technical smoothness and other weak regularity conditions common in stochastic calculus. On the other hand Condition 1 is necessary for us to be able to construct quasi-stationary Monte Carlo methods. However, since non-pathological densities on are generally in fact convex in the tails, by using the second identity in (3), then Condition 1 is almost always satisfied in real examples.

Theorem 1 can be exploited for statistical purposes by noting that for a sufficiently large , we can assume for . Thus if we can simulate from for , this would give us an (approximate) sample from . This is analogous to MCMC, with being the burn-in period; the only difference is that we need to simulate from the distribution of the process conditional upon it not having died.

The next two sections describe how we can simulate from . Firstly we describe how we can simulate the killed Brownian motion process exactly in continuous-time. A naïve approach to sample from , is to simulate independent realisations of this killed Brownian motion, and use the values at time of those processes which have not yet died by time . In practice this is impracticable, as the probability of survival will, in general, decay exponentially with . To overcome this we use sequential Monte Carlo methods.

Both these two steps introduce additional challenges not present within standard MCMC. Thus a natural question is: why would we want to use quasi-stationary Monte Carlo at all? We address this in Section 4 where we show that we can simulate the killing events using just subsamples of data. We can in fact use subsamples of size 2 without introducing any approximation into the dynamics of the killed Brownian motion.

3 Implementing QSMC

3.1 Simulating Killed Brownian Motion

Theorem 1 enables us to relate a target distribution we wish to sample from to the quasi-stationary distribution of killed Brownian motion. To be able to simulate from this quasi-stationary distribution we need to be able to simulate from killed Brownian motion. Here we describe how to do this.

To help get across the main ideas we will first consider the case where the killing rate, , is bounded above by some constant, say. In this case we can use thinning (see, for example, [36]) to simulate the time at which the process will die. This involves simulating the Brownian motion independently of a Poisson process with rate . Each event of the Poisson process is a potential death event, and we can then simulate whether or not the death occurs. For an event at time the probability that death occurs depends on the state of the Brownian motion at time , and is equal to . Thus to simulate the killed Brownian motion to time we would first simulate all events in our Poisson process to time . We then consider the events in time-order, simulate the Brownian motion at the first event-time and simulate whether death occurs. If it does not, we move to the next event-time. This is repeated until either the process dies or the process has survived the last potential death event in . If the latter occurs we can then simulate the Brownian motion at time .

This can be viewed as a rejection sampler to simulate from , the distribution of the Brownian motion at time conditional on it surviving to time . Any realisation that has been killed is ‘rejected’ and a realisation that is not killed is a draw from . We can easily construct an importance sampling version of this rejection sampler. Assume there are events in our Poisson process before time , and these occur at times . We simulate the Brownian motion path at each event time and at time . The output of our importance sampler is the realisation at time , , together with an importance sampling weight that is equal to the probability of the path surviving each potential death event,

If we have a positive lower bound on the killing rate, so for all , then we can improve the computational efficiency of the rejection sampler by splitting the death process into a death process of rate and one of rate . Actual death occurs at the first event in either of these processes. The advantage we have from this construction is that the former death process is independent of the Brownian motion. Thus we can first simulate whether or not death occurs in this process. If it does not we then simulate, using thinning as above, a killed Brownian motion with rate . The latter will have a lower intensity and thus be quicker to simulate. If we use the importance sampling version, instead, we will simulate events in a Poisson process of rate , say, and assign our realisation at time a weight

This is particularly effective as the is a constant which will cancel when we normalise the importance sampling weights.

3.2 Simulating Killed Brownian Motion using Local Bounds

The approach in Section 3.1 is not applicable if we cannot upper bound the killing rate. Even in situations where we can obtain an upper bound, it may be inefficient if this bound is large. We can overcome both these issues using local bounds on the rate. For this section we will work with the specific form of the killing rate in Theorem 1, namely . The bounds we will use will be in terms of bounds on .

Given an initial value for the Brownian motion, , define a hypercube which contains . In practice we define this cube to be centred on with a user-chosen side length (which may depend on ). Denote the hypercube by , and assume that we can find an upper and lower bound, and respectively, for with . We can use the thinning idea of the previous section to simulate the killed Brownian motion whilst the process stays within . Furthermore we are able to simulate the time at which the Brownian motion first leaves and the value of the process when this happens (see Appendix C). Thus our approach is to use our local bounds on , and hence on the killing rate, to simulate the killing process while remains in . If the process leaves before we then need to define a new hypercube, say, obtain new local bounds on for and repeat simulating the killing process using these new bounds until the process either first leaves the hypercube or we reach time .

We now fill in the details of this approach, describing the importance sampling version which we use later — though a rejection sampler can be obtained using similar ideas. Our first step is to calculate the hypercube, , and the bounds , . We then simulate the time and position at which first leaves . We call this the layer information, and denote it as . The notion of a layer for diffusions was formally introduced in [48], and we refer the interested reader there for further details. Next we simulate the possible killing events on , by simulating events of a Poisson process of rate : say. We then simulate the values of our Brownian motion at these event times (the simulation of which is conditional on — see Appendix C.2 and Algorithm 5 for how this can be done). We calculate an incremental importance sampling weight for this segment of time as

(5)

If we then repeat this process with a hypercube centred on , and continue to repeat until we have simulated the process to time . This gives successive iterated weights . We output a simulated value for the Brownian motion at time , again simulated conditional on the layer information for the current segment of time, and an importance sampling weight that is the product of the incremental weights associated with each segment of time. At time , we have computed incremental weights and have cumulative weight

(6)

Full algorithmic detail of the description above are given in Algorithm 1. In practice every sample will have an importance weight that shares a common constant of in (6). As such it is omitted from Algorithm 1 and the weights are asterisked to denote this. It is straightforward to prove that this approach gives valid importance sampling weights in the following sense.

Theorem 2. 

For each we have

Proof.

First note that by direct calculation of its Doob-Meyer decomposition conditional on , is a martingale, see for example [50]. Therefore and the result follows. ∎

  1. Initialise: Input initial value , and time interval length . Set , , , .

  2. : Choose hypercube and calculate , . Simulate layer information as per Appendix C, obtaining .

  3. : Simulate .

  4. : Set and .

  5. : Set .

  6. : Simulate as per Appendix C.2 and Algorithm 5.

  7. : If then output and . Otherwise, if , set , and return to Step 2. Else set and return to Step 3.

Algorithm 1 Importance Sampling Killed Brownian Motion (IS-KBM) Algorithm

3.3 Simulating from the Quasi-stationary Distribution

In theory we can use our ability to simulate from , using either rejection sampling or importance sampling, to simulate from the quasi-stationary distribution of our killed Brownian motion. We would need to specify a ‘burn-in’ period of length say, as in MCMC, and then simulate from . If is chosen appropriately these samples would be draws from the quasi-stationary distribution. Furthermore we can propagate these samples forward in time to obtain samples from for , and again these would, marginally, be draws from the quasi-stationary distribution.

However, in practice this simple idea is unlikely to work. We can see this most clearly with the rejection sampler, as the probability of survival will decrease exponentially with — and thus the rejection probability will often be prohibitively small.

There have been a number of suggested approaches to overcome the inefficiency of this naïve approach to simulating from a quasi-stationary distribution (see for example [19, 29], and the recent rebirth methodology of [11]). Our approach is to use ideas from sequential Monte Carlo. In particular, we will discretise time into intervals of length for some chosen and . Defining for , we use our importance sampler to sample times from ; this will give us particles, that is realisations of , and their associated importance sampling weights. We normalise the importance sampling weights, and calculate the variance of these normalised weights at time . If this is sufficiently large we resample the particles, by simulating times from the empirical distribution defined by the current set of weighted particles. If we resample, we assign each of the new particles a weight .

We then take our set of weighted particles at time and propagate them to get a set of weighted particles at time . The new importance sampling weights are just the weights at time , prior to propagation, multiplied by the (incremental) importance sample weight we calculate when propagating the particle from time to . The above resampling procedure is applied, and this whole iteration is repeated until we have our weighted particles at time . This approach is presented as the Quasi-Stationary Monte Carlo (QSMC) algorithm in Algorithm 2 in which is the effective sample size of the weights [37], a standard way of monitoring the variance of the importance sampling weights within sequential Monte Carlo, and is a user chosen threshold which determines whether we resample. The algorithm outputs the weighted particles at the end of each iteration.

  1. Initialisation Step ()

    1. Input: Starting distribution, , number of particles, , and set of times .

    2. : For in to simulate and set .

  2. Iterative Update Steps ( while )

    1. : If then for in to resample , the empirical distribution defined by the current set of weighted particles, and set .

    2. For in to ,

      1. : Simulate along with un-normalised weight increment as per Algorithm 1.

      2. : Calculate un-normalised weights, .

    3. : For in to set .

    4. : Set .

Algorithm 2 Quasi-Stationary Monte Carlo Algorithm (QSMC) Algorithm.


Given the output from Algorithm 2, we can then estimate the target distribution . We first choose a burn-in time, , such that we believe we have achieved convergence to the quasi-stationary distribution of killed Brownian motion. Our approximation to the law of the killed process is then simply the weighted occupation measures of the particle trajectories in the interval . More precisely, using the output of the QSMC algorithm.

(7)

4 Sub-sampling

We now return to the problem of sampling from the posterior in a big-data setting. We will assume we can write the target posterior as

(8)

where is the prior and are likelihood terms. Note that to be consistent with our earlier notation refers to the parameters in our model. The assumption of this factorisation is quite weak and includes many classes of models exhibiting various types of conditional independence structure.

We can sample from this posterior using Algorithm 2 by choosing , and hence , which determines the death rate of the killed Brownian motion, as defined in (3) and (4) respectively. In practice this will be computationally prohibitive as at every potential death event we determine acceptance by evaluating , which involves calculating derivatives of the log-posterior, and so requires accessing the full data set of size . However, it is easy to unbiasedly estimate using sub-samples of the data as the log-posterior is a sum over the different data-points. Here we show that we can use such an unbiased estimator of whilst still simulating the underlying killed Brownian motion exactly.

4.1 Simulating Killed Brownian Motion with an Unbiased Estimate of the Killing Rate

To introduce our approach we begin by assuming we can simulate an auxiliary random variable , and (without loss of generality) construct a positive unbiased estimator, , such that

(9)

We rely on the following simple result which is stated in a general way as it is of independent interest for simulating from events of probability which are expensive to compute, but that admit a straightforward unbiased estimator. Its proof is trivial and will be omitted.

Proposition 1. 

Let , and suppose that is a random variable with and almost surely. Then if then the event has probability .

We now adapt this result to our setting, noting that the randomness obtained by direct simulation of a -coin, and that using Proposition 1, is indistinguishable.

Recall that in Section 3.1 in order to simulate a Poisson process of rate we use Poisson thinning. Simply restating our approach: we first find for our Brownian motion trajectory, constrained to the hypercube , a constant such that we have . We then simulate a dominating Poisson process of rate to obtain potential death events, and then in sequence accept or reject each potential death event. Considering a single such event, occurring at time say, then this will be accepted as a death with probability .

Note that we could equivalently simulate a Poisson process of rate using a dominating Poisson process of higher rate . This is achieved by simply substituting for in the argument above. However, the penalty for doing this is an increase in the expected computational cost by a factor of – we would expect to have a larger number of potential death events, each of which will have a smaller acceptance probability.

Now, suppose for our unbiased estimator we can find some such that we have . Noting from (9) that we have an unbiased estimator of the probability of a death event in the above argument (i.e. ), and by appealing to Proposition 1, another (entirely equivalent) formulation of the Poisson thinning argument above is to use a dominating Poisson process of rate , and determine acceptance or rejection of each potential death event by simulating and accepting with probability (instead of ).

In the remainder of this section we exploit this extended construction of Poisson thinning (using an auxiliary random variable and unbiased estimator), to develop a scalable alternative to the QSMC approach we introduced in Algorithm 2. The key idea in doing so is to find an auxiliary random variable and unbiased estimator which can be simulated and evaluated without fully accessing the data set, while ensuring the increased number of evaluations necessitated by the ratio does not grow too severely.

4.2 Constructing a scalable replacement estimator

Noting from (3) and (4) that the selection of required to sample from a posterior is determined by , in this section we focus on finding a practical construction of a scalable unbiased estimator for . Recall that,

(10)

and that as per Algorithm 2, whilst we stay within our hypercube , we require and can find constants and such that . Now, as motivated by Section 4.1, we will construct an auxiliary random variable , an unbiased estimator such that

(11)

and determine constants and such that within the same hypercube we have . Further note that to ensure the validity of our QSMC approach, as justified by Theorem 1 in Section 3.3, we need to substitute Condition 1 with the following (similarly weak) condition:

Condition 2 (). 

There exists a constant such that .

To ensure practicality and scalability we (critically) need to focus on ensuring the ratio

(12)

does not grow too severely with the size of the data set (as this determines the factor increase in the rate, and hence increased inefficiency, of the dominating Poisson process required within Algorithm 2). To do this we develop a tailored control variate, of a similar type to that which has since been successfully used within our concurrent work on MCMC (see [10]).

To implement the control variate estimator we need to first find a point close to a mode of the posterior distribution , denoted by . In fact for our scaling arguments to hold, we require to be within of the true mode, and achieving this is a less demanding task than actually locating the mode. Moreover we note that this operation is only required to be done once, and not at each iteration, and so can be done fully in parallel. In practice a good suggestion is often to use a stochastic gradient optimisation algorithm to find a value close to the posterior mode, and we recommend then starting the simulation of our killed Brownian motion from this value, or from some suitably chosen distribution centred at this value. By doing this we substantially reduce the burn-in time of our algorithm.

To address scalability for multi-modal posteriors is a more challenging problem, out-with what we can address fully in this paper, but of significant interest for future work. We do however make the following remarks: In the case where we are within of a local mode, but not of the true mode, then we are not in the usual situation of posterior contraction, and we address this issue at the end of this section. In the case of multi-modality, but with the usual posterior contraction, then our scalability will still hold, but the computational cost may have additional constant factors. One possible solution for such a situation may be to note that we are not limited to a single control variate, but may have multiple, or indeed develop methodologies which refresh control variates as a by-product.

Remember that . Letting be the law of we have

(13)

As such, we can re-express as

(14)

where is a constant. Letting now be the law of we can construct the following unbiased estimator of ,

(15)

The construction of our estimator requires evaluation of the constants and . Although both are evaluations they only have to be computed once, and furthermore as previously mentioned can be calculated entirely in parallel.

Embedding our sub-sampling estimator described above within the QSMC algorithm of Section 3.3, we arrive at Algorithm 3 which we call the Scalable Langevin Exact algorithm (ScaLE). A similar modification could be made to the rejection sampling version, R-QSMC, which we discussed in Section 3.3 and detailed in Appendix F. We term this variant Rejection Scalable Langevin Exact algorithm (R-ScaLE) and provide full algorithmic details in Appendix G.

  1. Choose and compute , .

  2. On calling Algorithm 1,

    1. Replace in Step 2 with .

    2. Replace Step 7 with: : If , set , and return to Step 2. Else simulate , with , and set (where is defined as in (15)) and return to Algorithm 1 Step 3.

Algorithm 3 The ScaLE Algorithm (as per Algorithm 2 unless stated otherwise).

4.3 Bounding the Sub-sampling Killing Rate and the Complexity of the ScaLE algorithm

To implement the ScaLE algorithm requires the calculation of upper and lower bounds on . The major computational cost of the algorithm comes through re-weighting which occurs at times of a Poisson process which has (varying) intensity (as defined in 12). This section will demonstrate that (and hence the ratio ) does in fact typically scale as , so that the iterative cost of the algorithm is itself .

A general way to do this is to use global, or local, bounds on the second-derivatives of the log-likelihood for each datum. To simplify the following exposition we will assume a global bound, so that

(16)

for some , where represents the spectral radius. For smooth densities with Gaussian and heavier tails, the Hessian of the log-likelihood is typically uniformly bounded (in both data and parameter). In such cases we would expect to have such a global bound, and in fact should be constant in .

Recalling the layer construction of Section 3.2 for a single trajectory of killed Brownian motion, we can ensure that over any finite time interval we have , some hypercube. We shall set to be the centre of this hypercube.

In this section, we shall eventually assume that the posterior contracts at a rate . The regular case will correspond to the case where although we do not need to make any explicit assumptions about normality here. Thus the practitioner has complete freedom to choose , and it makes sense to choose this so that for some and for all .

We can bound both above and below if we can bound over all possible realisations of . To bound , we begin by first considering our elementary estimator in (13). By imposing our condition in (16) we have

(17)

We can proceed to find a bound for our estimator in (15) as follows (recalling )

(18)

Note that the last term on the right hand side can never be larger than the first one. Finally we can use the fact that to bound the terms in this expression.

We now directly consider the computational efficiency of ScaLE, and how it depends on . If we have contraction of the posterior variance at a rate then it is natural to expect we only need to run the killed Brownian motion for a time of to have sufficient opportunity to mix through the support of the target. Similar arguments have been used for scaling arguments of other Monte Carlo algorithms that use similar control variates, see for instance our concurrent work [10].

So for a fixed effective sample size (ESS) we will need to simulate a number of events in ScaLE (its complexity) that will be of order . Using the bound on that we have on our hypercube centred on , we have that whilst we remain within our hypercube,

(19)

Here we have assumed that at stationarity will be a draw from the support of the posterior, so that under our assumption of posterior contraction at the regular rate, then . We summarise this discussion in the following:

Proposition 2. 

Suppose that we have posterior contraction at rate , is and that for some . Then ScaLE is where

In particular, where , we obtain

and where additionally we obtain and that the iterative complexity of ScaLE is (1).

This result also illuminates the role played by in the efficiency of the algorithm. In the following discussion we shall assume that . It is worth noting that while a completely arbitrary starting value for might make an quantity leading to an iterative complexity of the algorithm which is . To obtain we simply require that be which gives considerable leeway for any initial explorative algorithm to find a good value for .

Note that when we have bounds on the third derivatives, (19) can be improved by linearising the divergence term in (15). We exploit this idea later in a logistic regression example (see Section 5.1).

If we do not have a global bound on the second derivatives, we can replace in the above arguments by any constant that bounds the second-derivatives for all such that . In this case the most extreme that may grow is logarithmically with , for instance for light-tailed models where the data really comes from the model being used. Where the tails are mis-specified and light-tailed models are being used, the algorithmic complexity can be considerably worse. There is considerable scope for more detailed analyses of these issues in future work.

The above arguments give insight into the impact of our choice of . It affects the bound on , and hence the computational efficiency of ScaLE, through the terms . Furthermore we have that the main term in the order of is the square of this distance. If is the posterior mean, then the square of this distance will, on average, be the posterior variance. By comparison if is posterior standard deviations away from the posterior mean, then on average the square distance will be times the posterior variance (for some constant ), and the computational cost of ScaLE will be increased by a factor of roughly .

4.4 Theoretical Properties

SMC algorithms in both discrete and continuous time have been studied extensively in the literature (for related theory for approximating a fixed-point distribution see [23, 22]). In order to avoid a lengthy technical diversion, we restrict ourselves here to studying a slightly simplified version of the problem in order to obtain the simplest and most interpretable possible form of results. We defer the technical details of this construction to Appendix H and give here only a qualitative description intended to guide intuition and the key result: that the resulting estimator satisfies a Gaussian central limit theorem with the usual Monte Carlo rate.

We consider a variant of the algorithm in which (multinomial) resampling occurs at times for where is a time step resolution specified in advance and consider the behaviour of estimates obtained at these times. Extension to resampling at a random subset of these resampling times would be possible using the approach of [21], considering precisely the QSMC algorithm presented in Algorithm 2 and the ScaLE algorithm in Algorithm 3 would require additional technical work somewhat beyond the scope of this paper; we did not observe any substantial difference in behaviour.

In order to employ standard results for SMC algorithms it is convenient to consider a discrete time embedding of the algorithms described. We consider an abstract formalism in which between the specified resampling times the trajectory of the Brownian motion is sampled, together with such auxiliary random variables as are required in any particular variant of the algorithm. Provided the potential function employed to weight each particle prior to resampling has conditional expectation (given the path) proportional to the exact killing rate integrated over these discrete time intervals we recover a valid version of the ScaLE algorithm.

This discrete time formalism allows for results on more standard SMC algorithms to be applied directly to the ScaLE algorithm. We provide in the following proposition a straightforward corollary to [20, Chapter 9], which demonstrates that estimates obtained from a single algorithmic time slice of the ScaLE algorithm satisfy a central limit theorem.

Proposition 3 (Central Limit Theorem). 

In the context described, under mild regularity conditions (see references given in Appendix H):

where, , is a standard normal random variable, denotes convergence in distribution, and depends upon the precise choice of sub-sampling scheme as well as the test function of interest and is specified in Appendix H.

5 Examples

In this section we present two applications of the methodology developed in this paper. In particular, in Section 5.1 for exposition we consider the application of ScaLE to logistic regression with a data set of size . In Section 5.2 we consider a problem in which we wish do parameter inference for a more complicated mixture model, motivated by a big data application with .

5.1 Example 1: Logistic Regression

In this subsection we consider an application of ScaLE to a dimensional logistic regression model, considering data sets of up to size . Logistic regression is a model frequently employed within big data settings [52], and here we will illustrate the scalability of ScaLE for this canonical model. Stating our model more precisely, we have binary observations and fit a model of the form,

(20)

We generate a data set of size from this model by first constructing a design matrix in which the entry , where are i.i.d. truncated Normal random variables. In the big data setting it is natural to assume such control on the extreme entries of the design matrix, either through construction or physical limitation. Upon simulating the design matrix, binary observations are obtained by simulation using the parameters . It should be noted that non-standard computational considerations were required in this example due to the extreme size of the data set. For instance, note that to store a data set of size and access it in the conventional manner would require at least Gb of RAM. However, we omit such discussion as it it is not pertinent to our exposition.

We executed ScaLE on increasingly large subsets of the full data set, from through . Each execution first required construction of (our control variate), which we described in Section 4.2, at a point close to the mode of our posterior. To be representative of what may be achievable in practise, to minimise evaluation of the full data set, and to be comparable between executions of ScaLE on differing data set size, we choose the point used to simulate the data (noting that this will not be the posterior mode, but could be conceivably similar to a point found using an optimisation scheme). Note that full evaluation of the data set in the conventional manner to construct this control variate required around days of single-core computation time (however, we conducted this computation entirely in parallel on a network of cores). The particle trajectories themselves were initialised from a Normal distribution with inflated identity matrix.

For each data set size we used a particle set of size , and for each required evaluation of the sub-sampling estimator we associated with each particle a subsample of the data set of size . As is typical for SMC methods, the stability of the methods depends on the choice of being large enough for the model and dimensionality of the problem in question. A particle set of size was chosen as it (and smaller sizes) exhibited some desired stability. was chosen as the sub-sample size as it naturally balanced the computational expenditure with other computational components of ScaLE – this being predominately composed of an expensive updating (in continuous time) of each particle trajectory – while providing additional stability to the importance weights.

Illustrative simulations of ScaLE of comparable length (both in terms of execution time, and diffusion time), with the data sets of size and respectively are given in Figure 1. Simulation of the data set of size required sub-sampling evaluations per unit of diffusion time, and resampling was conducted in continuous time on average every units of diffusion time. In contrast the data set of size required sub-sampling evaluations per unit of diffusion time, and resampling was conducted in continuous time on average every units of diffusion time. This is aligned to the theory we developed in Section 4.3, further noting that in this example for a given hypercube (suitably rescaled for data size) we obtain precisely the same dominating intensity .

Figure 1: Trace trajectories of ScaLE applied to logistic regression data sets of size (upper plots) and (lower plots) respectively. Red lines on the trace plots mark the mean of the particle parameter values used to initialise the algorithm, blue lines correspond to the ergodic average (after discarding the initial of the trajectories as burn-in).


To verify the scaling of ScaLE we calculate for each data set size an average computational expenditure per Effective Sample Size (ESS). ESS is calculated using the particle trajectories from ScaLE at a single time-point as the ratio of the variance of the estimator of the parameter using these particles to the posterior variance of the parameter [13]. To calculate the overall ESS we then account for the dependence of these estimators over-time by modelling this dependence as an AR(1) process. We summarise the ESS as the mean ESS across estimates of the 5 parameters and present this in Figure 2.

Note from Figure 2, we would not advocate using the full computational machinery of ScaLE for a small data setting – the complexity of the implementation alone would outweigh the simplicity of algorithms such as MALA. However, for moderately large data sizes () upwards, then the extremely good scaling of the iterative cost of ScaLE provides substantial gains. To emphasise this point, note that for the data set size we obtained an ESS per unit of diffusion time of around , whereas with the same computational budget (including the computation of ) then or perhaps iterations of MALA could be conducted.

Figure 2: Scaling of ScaLE in contrast with MALA. Red line corresponds to MALA scaling (with dotted line projected scaling). Black line corresponds to ScaLE.

5.2 Example 2: Contaminated Mixture

In this subsection we consider an application of ScaLE to conduct parameter inference for a contaminated mixture model. This is motivated by big data sets obtained from internet applications, in which the large data sets are readily available, but the data is of low quality and corrupted with noisy observations. In particular, in our example each datum comprises two features and we consider fitting a model in which the likelihood of an individual observation () is,

(21)

In this model represents the level of corruption and the variance of the corruption. A common approach if we wished to conduct parameter inference for this model using MCMC is data augmentation [55]. However, for large data sets this is not feasible as the dimensionality of the auxiliary variable vector will be . For convenience a transformation of the likelihood was made so that each transformed parameter is on . The details are omitted, and the results presented are given under the original parameterisation.

We generate a data set of size from the model with parameters . This choice of data size was made as it was the largest size that could fit in memory of the computing architecture used, avoiding additional computational complications arising not relevant to our algorithm. Note however that this is not a fundamental bottleneck should our methodology be implemented in a manner biddable to an appropriate architecture.

We choose a point to initialise the algorithm (which could conceivably be) obtained using an optimisation scheme. The point chosen to initialise the algorithm was , and this was also used as the point to compute our control variate (described in Section 4.2), which took approximately hours of computational time to evaluate (and is broadly indicative of the length of time a single iteration of an alternative MCMC scheme such as MALA would require). Note that as discussed in Section 4.3, this ‘mis-initialisation’ impacts the efficiency of the algorithm by a constant factor, but is however representative of what one in practice may conceivably be able to do (i.e. find by means of an optimisation scheme a point within the support of the target posterior close to some mode, and conduct a single calculation).

Applying ScaLE for this application we used a particle set of size , and run the algorithm for diffusion time of , with observations of each trajectory at a resolution of . Again, the choice of was made as in Section 5.1 as it provided the required stability. The choice of was made as it corresponded approximately to a computational budget of one week (again recalling that in this time if we realistically obtain only around - iterations of a single trajectory of MALA).

Each particle trajectory at each time was associated with a sub-sample of the full data set of size . As in Section 5.1, was chosen as it provided balance with other components of the algorithm, but allowed stabilisation of the importance weights. In total the entire run required accessing million individual data points, which corresponds to approximately full evaluations of the data set. Note that this corresponds to around datum accesses per particle trajectory per unit of diffusion time, or alternatively stated, around evaluations of the unbiased estimator. Resampling events took place on average every unit of diffusion time, during which each particle trajectory required around evaluations of the unbiased estimator. This suggests a natural implementation (as the required intensity is so large) in which within the localisation construction of ScaLE the user-specified hypercube side lengths are with arbitrarily small probability exceeded, and so one could in principle simply simulate unconstrained Brownian motion. For simplicity here this is what we present.

An example of a typical run can be found in Figure 3. We choose a burn-in period of , and alongside the trace plots in Figure 3 we provide an estimate of the marginal density of the parameters using the occupation measure of the trajectories in the interval . Selection of the burn-in period was conducted by simple examination of trace plots and stability of ergodic averages, as one may in practice do for MCMC. Selecting appropriate burn-in periods, and constructing other empirical diagnoses for QSMC and ScaLE, lies out-with the scope of this paper, but is an interesting avenue for future research.

To assess the quality of the simulation we employ the same batch mean method to estimate the marginal ESS for the run post burn-in as we detailed in Section 5.1. Mean ESS per dimension for this run was around . An analysis of MALA (for a necessarily much smaller run), indicates it is possible to achieve an ESS of around , where corresponds to the run length subsequent to burn-in. As indicated above, and neglecting burn-in, this would mean an achievable ESS for a comparable computational budget would be around -. Accounting for burn-in then to achieve the same ESS as ScaLE could require over years of computation.

Figure 3: Upper graphs are trace trajectories of ScaLE applied to the contaminated mixture data set. Red lines on the trace plots mark the parameter values used to initialise the algorithm, blue lines mark the mode of the marginal densities. Lower plots are marginal densities obtained by ScaLE for the contaminated mixture data set. Red lines mark the parameter values for which the data set was generated.

6 Conclusions

In this paper we have introduced a new class of Quasi-Stationary Monte Carlo (QSMC) methods which are genuinely continuous-time algorithms for simulating from complex target distributions. We have emphasised its particular effectiveness in the context of big data by developing novel sub-sampling approaches and the Scalable Langevin Exact (ScaLE) algorithm. Unlike its immediate competitors, our sub-sampling approach within ScaLE is essentially computationally free and does not result in any approximation to the target distribution. Our methodology is embedded within an SMC framework, which we use to provide theoretical results. In addition we give examples to which we apply ScaLE, demonstrating its robust scaling properties for large data sets.

We see this paper as a first step in what we hope will be a fruitful new direction for Computational Statistics. Many ideas for variations and extensions to our implementation exist and will stimulate further investigation.

Firstly, the need to simulate a quasi-stationary distribution creates particular challenges. Although quasi-stationarity is underpinned by an elegant mathematical theory, the development of numerical methods for quasi-stationarity is understudied. We have presented an SMC methodology for this problem, but alternatives exist. For instance, [11] suggest alternative approaches.

Even within an SMC framework for extracting the quasi-stationary distribution, there are interesting alternatives we have not explored. For example, by a modification of our re-weighting mechanism we can relate the target distribution of interest to the limiting smoothing distribution of the process, as opposed to the filtering distribution as we do here. Within the quasi-stationary literature this is often termed the type II quasi-stationary distribution. As such, the rich SMC literature offers many other variations on the procedures we adopt here.

By using SMC we are able to benefit from the rich theory underpinning its use. However our use of quasi-stationary Monte Carlo actually demands new questions of SMC. Theorem 1 gives convergence as , while Proposition 3 gives a precise description of the limit as the number of particles increases. There are theoretical and practical questions associated with letting both and tend to together. Within our examples we have chosen ad hoc rules to assign computational effort to certain values of and . However the general question of how to choose these parameters seems completely open.

There are interesting options for parallel implementation of SMC algorithms in conjunction with ScaLE. We could for instance attempt to implement the island particle filter [24] which could have substantial effects on the efficiency of our algorithms where large numbers of particles are required. Alternatively one could attempt to embed our scheme in other divide and conquer schemes as described in the introduction.

We have also concentrated solely on killed (or re-weighted) Brownian motion in our paper. We have shown that this strategy has robust convergence properties. However, given existing methodology for the exact simulation of diffusions in [9, 7, 8, 46, 48, 47], there is scope to develop methods which use proposal measures which much better mimic the shape of the posterior distribution.

Our sub-sampling and control variate approaches offer dramatic computational savings as we see from the examples and from the theory of results like Proposition 2. However there may be scope to extend these ideas still further. For instance, it might be possible to sub-sample dimensions and thus reduce the dimensional complexity for implementing each iteration.

We conclude by noting that as a byproduct, the theory behind our methodology offers new insights into problems concerning the existence of quasi-stationary distributions for diffusions killed according to a state-dependent hazard rate, complementing and extending current state-of-the-art literature [54].

Acknowledgments

MP was supported by the EPSRC [grant number EP/K014463/1]. PF was supported by [grant numbers EP/N031938/1, EP/K014463/1]. GOR was supported by [grant numbers EP/K034154/1, EP/K014463/1, EP/D002060/1]. In addition, we would like to thank the referees for a number of very helpful suggestions and questions which has improved the exposition of this paper.

References

  • [1] {binproceedings}[author] \bauthor\bsnmAhn, \bfnmS.\binitsS., \bauthor\bsnmBalan, \bfnmA. K.\binitsA. K. \AND\bauthor\bsnmWelling, \bfnmM.\binitsM. (\byear2012). \btitleBayesian posterior sampling via stochastic gradient fisher scoring. In \bbooktitleProceedings of the 29th International Conference on International Conference on Machine Learning (ICML-12) \bpages1771–1778. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAndrieu, \bfnmC.\binitsC. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2009). \btitleThe pseudo-marginal approach for efficient Monte Carlo computations. \bjournalAnnals of Statistics \bvolume37 \bpages697–725. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmAsmussen, \bfnmS.\binitsS., \bauthor\bsnmGlynn, \bfnmP.\binitsP. \AND\bauthor\bsnmPitman, \bfnmJ.\binitsJ. (\byear1995). \btitleDiscretization error in simulation of one-dimensional reflecting Brownian motion. \bjournalAnnals of Applied Probability \bvolume5 \bpages875–896. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmBaker, \bfnmJ.\binitsJ., \bauthor\bsnmFearnhead, \bfnmP.\binitsP., \bauthor\bsnmFox, \bfnmE. B.\binitsE. B. \AND\bauthor\bsnmNemeth, \bfnmC.\binitsC. (\byear2017). \btitleControl Variates for Stochastic Gradient MCMC. \bjournalarXiv preprint arXiv:1706.05439. \endbibitem
  • [5] {binproceedings}[author] \bauthor\bsnmBardenet, \bfnmR.\binitsR., \bauthor\bsnmDoucet, \bfnmA.\binitsA. \AND\bauthor\bsnmHolmes, \bfnmC. C.\binitsC. C. (\byear2014). \btitleTowards scaling up Markov chain Monte Carlo: an adaptive subsampling approach. In \bbooktitleProceedings of the 31st International Conference on Machine Learning (ICML-14) \bpages405–413. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBardenet, \bfnmR.\binitsR., \bauthor\bsnmDoucet, \bfnmA.\binitsA. \AND\bauthor\bsnmHolmes, \bfnmC. C.\binitsC. C. (\byear2015). \btitleOn Markov chain Monte Carlo methods for tall data. \bjournalarXiv preprint arXiv:1505.02827. \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmBeskos, \bfnmA.\binitsA., \bauthor\bsnmPapaspiliopoulos, \bfnmO.\binitsO. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2006). \btitleRetrospective Exact Simulation of Diffusion Sample Paths with Applications. \bjournalBernoulli \bvolume12 \bpages1077–1098. \endbibitem
  • [8] {barticle}[author] \bauthor\bsnmBeskos, \bfnmA.\binitsA., \bauthor\bsnmPapaspiliopoulos, \bfnmO.\binitsO. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2008). \btitleA Factorisation of Diffusion Measure and Finite Sample Path Constructions. \bjournalMethodology and Computing in Applied Probability \bvolume10 \bpages85–104. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBeskos, \bfnmA.\binitsA. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2005). \btitleAn Exact Simulation of Diffusions. \bjournalAnnals of Applied Probability \bvolume15 \bpages2422–2444. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmBierkens, \bfnmJ.\binitsJ., \bauthor\bsnmFearnhead, \bfnmP.\binitsP. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2016). \btitleThe Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data. \bjournalarXiv preprint arXiv:1607.03188. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmBlanchet, \bfnmJ.\binitsJ., \bauthor\bsnmGlynn, \bfnmP.\binitsP. \AND\bauthor\bsnmZheng, \bfnmS.\binitsS. (\byear2016). \btitleAnalysis of a stochastic approximation algorithm for computing quasi-stationary distributions. \bjournalAdvances in Applied Probability \bvolume48 \bpages792–811. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmBurq, \bfnmZ. A.\binitsZ. A. \AND\bauthor\bsnmJones, \bfnmO.\binitsO. (\byear2008). \btitleSimulation of Brownian motion at first passage times. \bjournalMathematics and Computers in Simulation \bvolume77 \bpages64–71. \endbibitem
  • [13] {barticle}[author] \bauthor\bsnmCarpenter, \bfnmJ.\binitsJ., \bauthor\bsnmClifford, \bfnmP.\binitsP. \AND\bauthor\bsnmP., \bfnmFearnhead\binitsF. (\byear1999). \btitleImproved particle filter for nonlinear problems. \bjournalIEE Proceedings - Radar, Sonar and Navigation \bvolume146 \bpages2–7. \endbibitem
  • [14] {binproceedings}[author] \bauthor\bsnmChen, \bfnmC.\binitsC., \bauthor\bsnmDing, \bfnmN.\binitsN. \AND\bauthor\bsnmCarin, \bfnmL.\binitsL. (\byear2015). \btitleOn the convergence of stochastic gradient MCMC algorithms with high-order integrators. In \bbooktitleAdvances in Neural Information Processing Systems \bpages2278–2286. \endbibitem
  • [15] {binproceedings}[author] \bauthor\bsnmChen, \bfnmT.\binitsT., \bauthor\bsnmFox, \bfnmE. B.\binitsE. B. \AND\bauthor\bsnmGuestrin, \bfnmC.\binitsC. (\byear2014). \btitleStochastic Gradient Hamiltonian Monte Carlo. In \bbooktitleProceedings of the 31st International Conference on Machine Learning (ICML-14) \bpages1683–1691. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmChopin, \bfnmN.\binitsN. (\byear2004). \btitleCentral limit theorem for sequential Monte Carlo methods and its applications to Bayesian inference. \bjournalAnnals of Statistics \bvolume32 \bpages2385–2411. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmCiesielski, \bfnmZ.\binitsZ. \AND\bauthor\bsnmTaylor, \bfnmS. J.\binitsS. J. (\byear1962). \btitleFirst passage times and sojourn times for Brownian motion in space and the exact Hausdorff measure of the sample path. \bjournalAnnals of Mathematical Statistics \bvolume103 \bpages1434–1450. \endbibitem
  • [18] {bbook}[author] \bauthor\bsnmCollet, \bfnmP.\binitsP., \bauthor\bsnmMartinez, \bfnmS.\binitsS. \AND\bauthor\bsnmSan Martin, \bfnmS.\binitsS. (\byear2013). \btitleQuasi-Stationary Distributions: Markov Chains, Diffusions and Dynamical Systems. \bpublisherSpringer, Berlin. \endbibitem
  • [19] {barticle}[author] \bauthor\bparticlede \bsnmOliveira, \bfnmM. M.\binitsM. M. \AND\bauthor\bsnmDickman, \bfnmR.\binitsR. (\byear2005). \btitleHow to simulate the quasistationary state. \bjournalPhysical Review E \bvolume71 \bpages61–129. \endbibitem
  • [20] {bbook}[author] \bauthor\bsnmDel Moral, \bfnmP.\binitsP. (\byear2004). \btitleFeynman-Kac formulae: genealogical and interacting particle systems with applications, \bedition1st ed. \bpublisherSpringer, New York. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmDel Moral, \bfnmP.\binitsP., \bauthor\bsnmDoucet, \bfnmA.\binitsA. \AND\bauthor\bsnmJasra, \bfnmA.\binitsA. (\byear2012). \btitleOn Adaptive Resampling Procedures for Sequential Monte Carlo Methods. \bjournalBernoulli \bvolume18 \bpages252–278. \endbibitem
  • [22] {barticle}[author] \bauthor\bsnmDel Moral, \bfnmP.\binitsP., \bauthor\bsnmJacob, \bfnmP. E.\binitsP. E., \bauthor\bsnmLee, \bfnmA.\binitsA., \bauthor\bsnmMurray, \bfnmL.\binitsL. \AND\bauthor\bsnmPeters, \bfnmG. W.\binitsG. W. (\byear2013). \btitleFeynman-Kac particle integration with geometric interacting jumps. \bjournalStochastic Analysis and Applications \bvolume31 \bpages830–871. \endbibitem
  • [23] {barticle}[author] \bauthor\bsnmDel Moral, \bfnmP.\binitsP. \AND\bauthor\bsnmMiclo, \bfnmL.\binitsL. (\byear2003). \btitleParticle approximations of Lyapunov exponents connected to Schrödinger operators and Feynman–Kac semigroups. \bjournalESAIM: Probability and Statistics \bvolume7 \bpages171–208. \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmDel Moral, \bfnmP.\binitsP., \bauthor\bsnmMoulines, \bfnmE.\binitsE., \bauthor\bsnmOlsson, \bfnmJ.\binitsJ. \AND\bauthor\bsnmVergé, \bfnmC.\binitsC. (\byear2016). \btitleConvergence properties of weighted particle islands with application to the double bootstrap algorithm. \bjournalStochastic Systems \bvolume6 \bpages367–419. \endbibitem
  • [25] {bbook}[author] \bauthor\bsnmDevroye, \bfnmL.\binitsL. (\byear1986). \btitleNon-Uniform Random Variate Generation, \bedition1st ed. \bpublisherSpringer, New York. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmDevroye, \bfnmL.\binitsL. (\byear2009). \btitleOn exact simulation algorithms for some distributions related to Jacobi theta functions. \bjournalStatistics and Probability Letters \bvolume79 \bpages2251–2259. \endbibitem
  • [27] {binproceedings}[author] \bauthor\bsnmDubey, \bfnmK. A.\binitsK. A., \bauthor\bsnmReddi, \bfnmS. J.\binitsS. J., \bauthor\bsnmWilliamson, \bfnmS. A.\binitsS. A., \bauthor\bsnmPoczos, \bfnmB.\binitsB., \bauthor\bsnmSmola, \bfnmA. J.\binitsA. J. \AND\bauthor\bsnmXing, \bfnmE. P.\binitsE. P. (\byear2016). \btitleVariance Reduction in Stochastic Gradient Langevin Dynamics. In \bbooktitleAdvances in Neural Information Processing Systems \bpages1154–1162. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmFort, \bfnmG.\binitsG. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2005). \btitleSubgeometric ergodicity of strong Markov processes. \bjournalAnnals of Applied Probability \bvolume15 \bpages1565–1589. \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmGroisman, \bfnmP.\binitsP. \AND\bauthor\bsnmJonckheere, \bfnmM.\binitsM. (\byear2013). \btitleSimulation of quasi-stationary distributions on countable spaces. \bjournalMarkov Processes and Related Fields \bvolume19 \bpages521–542. \endbibitem
  • [30] {binproceedings}[author] \bauthor\bsnmHuggins, \bfnmJ. H.\binitsJ. H. \AND\bauthor\bsnmZou, \bfnmJ.\binitsJ. (\byear2017). \btitleQuantifying the accuracy of approximate diffusions and Markov chains. In \bbooktitleProceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS 2017) \bpages382–391. \endbibitem
  • [31] {barticle}[author] \bauthor\bsnmJacob, \bfnmP. E.\binitsP. E. \AND\bauthor\bsnmThiéry, \bfnmA. H.\binitsA. H. (\byear2015). \btitleOn non-negative unbiased estimators. \bjournalAnnals of Statistics \bvolume43 \bpages769–784. \endbibitem
  • [32] {barticle}[author] \bauthor\bsnmJohansen, \bfnmA. M.\binitsA. M. \AND\bauthor\bsnmDoucet, \bfnmA.\binitsA. (\byear2008). \btitleA Note on the Auxiliary Particle Filter. \bjournalStatistics and Probability Letters \bvolume78 \bpages1498–1504. \endbibitem
  • [33] {barticle}[author] \bauthor\bsnmJohnson, \bfnmR. A.\binitsR. A. (\byear1970). \btitleAsymptotic expansions associated with posterior distributions. \bjournalAnnals of Mathematical Statistics \bvolume41 \bpages851–864. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmJordan, \bfnmM. I.\binitsM. I. (\byear2013). \btitleOn statistics, computation and scalability. \bjournalBernoulli \bvolume19 \bpages1378–1390. \endbibitem
  • [35] {bbook}[author] \bauthor\bsnmKaratzas, \bfnmI.\binitsI. \AND\bauthor\bsnmShreve, \bfnmS.\binitsS. (\byear1991). \btitleBrownian Motion and Stochastic Calculus, \bedition2nd ed. \bpublisherSpringer, New York. \endbibitem
  • [36] {bbook}[author] \bauthor\bsnmKingman, \bfnmJ. F. C.\binitsJ. F. C. (\byear1992). \btitlePoisson Processes, \bedition1st ed. \bpublisherClarendon Press, Oxford. \endbibitem
  • [37] {barticle}[author] \bauthor\bsnmKong, \bfnmA.\binitsA., \bauthor\bsnmLiu, \bfnmJ. S.\binitsJ. S. \AND\bauthor\bsnmWong, \bfnmW. H.\binitsW. H. (\byear1994). \btitleSequential Imputations and Bayesian Missing Data Problems. \bjournalJournal of the American Statistical Association \bvolume89 \bpages278–288. \endbibitem
  • [38] {binproceedings}[author] \bauthor\bsnmKorattikara, \bfnmA.\binitsA., \bauthor\bsnmChen, \bfnmY.\binitsY. \AND\bauthor\bsnmWelling, \bfnmM.\binitsM. (\byear2014). \btitleAusterity in MCMC Land: Cutting the Metropolis-Hastings Budget. In \bbooktitleProceedings of the 31st International Conference on Machine Learning (ICML-14) \bpages181–189. \endbibitem
  • [39] {barticle}[author] \bauthor\bsnmLi, \bfnmC.\binitsC., \bauthor\bsnmSrivastava, \bfnmS.\binitsS. \AND\bauthor\bsnmDunson, \bfnmD. B.\binitsD. B. (\byear2017). \btitleSimple, scalable and accurate posterior interval estimation. \bjournalBiometrika \bvolume104 \bpages665–680. \endbibitem
  • [40] {binproceedings}[author] \bauthor\bsnmMa, \bfnmY.\binitsY., \bauthor\bsnmChen, \bfnmT.\binitsT. \AND\bauthor\bsnmFox, \bfnmE.\binitsE. (\byear2015). \btitleA complete recipe for stochastic gradient MCMC. In \bbooktitleAdvances in Neural Information Processing Systems \bpages2917–2925. \endbibitem
  • [41] {binproceedings}[author] \bauthor\bsnmMaclaurin, \bfnmD.\binitsD. \AND\bauthor\bsnmAdams, \bfnmR. P.\binitsR. P. (\byear2015). \btitleFirefly Monte Carlo: Exact MCMC with Subsets of Data. In \bbooktitleProceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI) \bpages4289–4295. \endbibitem
  • [42] {binproceedings}[author] \bauthor\bsnmMinsker, \bfnmS.\binitsS., \bauthor\bsnmSrivastava, \bfnmS.\binitsS., \bauthor\bsnmLin, \bfnmL.\binitsL. \AND\bauthor\bsnmDunson, \bfnmD. B.\binitsD. B. (\byear2014). \btitleScalable and robust Bayesian inference via the median posterior. In \bbooktitleProceedings of the 31st International Conference on Machine Learning (ICML-14) \bpages1656–1664. \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmNagapetyan, \bfnmT.\binitsT., \bauthor\bsnmDuncan, \bfnmA. B.\binitsA. B., \bauthor\bsnmHasenclever, \bfnmL.\binitsL., \bauthor\bsnmVollmer, \bfnmS. J.\binitsS. J., \bauthor\bsnmSzpruch, \bfnmL.\binitsL. \AND\bauthor\bsnmZygalakis, \bfnmK.\binitsK. (\byear2017). \btitleThe True Cost of Stochastic Gradient Langevin Dynamics. \bjournalarXiv preprint arXiv:1706.02692. \endbibitem
  • [44] {binproceedings}[author] \bauthor\bsnmNeiswanger, \bfnmW.\binitsW., \bauthor\bsnmWang, \bfnmC.\binitsC. \AND\bauthor\bsnmXing, \bfnmE.\binitsE. (\byear2014). \btitleAsymptotically Exact, Embarrassingly Parallel MCMC. In \bbooktitleProceedings of the Thirtieth Conference on Uncertainty In Artificial Intelligence (2014) \bpages623–632. \endbibitem
  • [45] {barticle}[author] \bauthor\bsnmNicholls, \bfnmG. K.\binitsG. K., \bauthor\bsnmFox, \bfnmC.\binitsC. \AND\bauthor\bsnmWatt, \bfnmA. M.\binitsA. M. (\byear2012). \btitleCoupled MCMC with a randomized acceptance probability. \bjournalarXiv preprint arXiv:1205.6857. \endbibitem
  • [46] {bphdthesis}[author] \bauthor\bsnmPollock, \bfnmM.\binitsM. (\byear2013). \btitleSome Monte Carlo Methods for Jump Diffusions \btypePhD thesis, \bpublisherDepartment of Statistics, University of Warwick. \endbibitem
  • [47] {binproceedings}[author] \bauthor\bsnmPollock, \bfnmM.\binitsM. (\byear2015). \btitleOn the exact simulation of (jump) diffusion bridges. In \bbooktitleProceedings of the 2015 Winter Simulation Conference \bpages348–359. \endbibitem
  • [48] {barticle}[author] \bauthor\bsnmPollock, \bfnmM.\binitsM., \bauthor\bsnmJohansen, \bfnmA. M.\binitsA. M. \AND\bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. (\byear2016). \btitleOn the Exact and -Strong Simulation of (Jump) Diffusions. \bjournalBernoulli \bvolume22 \bpages794–856. \endbibitem
  • [49] {barticle}[author] \bauthor\bsnmQuiroz, \bfnmM.\binitsM., \bauthor\bsnmMinh-Ngoc, \bfnmT.\binitsT., \bauthor\bsnmVillani, \bfnmM.\binitsM. \AND\bauthor\bsnmKohn, \bfnmR.\binitsR. (\byear2016). \btitleExact subsampling MCMC. \bjournalarXiv preprint arXiv:1603.08232. \endbibitem
  • [50] {bbook}[author] \bauthor\bsnmRevuz, \bfnmD.\binitsD. \AND\bauthor\bsnmYor, \bfnmM.\binitsM. (\byear2013). \btitleContinuous martingales and Brownian motion \bvolume293. \bpublisherSpringer Science & Business Media. \endbibitem
  • [51] {bbook}[author] \bauthor\bsnmRobert, \bfnmC.\binitsC. \AND\bauthor\bsnmCasella, \bfnmG.\binitsG. (\byear2004). \btitleMonte Carlo Statistical Methods, \bedition2nd ed. \bpublisherSpringer. \endbibitem
  • [52] {barticle}[author] \bauthor\bsnmScott, \bfnmS. L.\binitsS. L., \bauthor\bsnmBlocker, \bfnmA. W.\binitsA. W., \bauthor\bsnmBonassi, \bfnmF. V.\binitsF. V., \bauthor\bsnmChipman, \bfnmH. A.\binitsH. A., \bauthor\bsnmGeorge, \bfnmE. I.\binitsE. I. \AND\bauthor\bsnmMcCulloch, \bfnmR. E.\binitsR. E. (\byear2016). \btitleBayes and Big Data: The Consensus Monte Carlo Algorithm. \bjournalInternational Journal of Management Science and Engineering Management \bvolume11 \bpages78–88. \endbibitem
  • [53] {binproceedings}[author] \bauthor\bsnmSrivastava, \bfnmS.\binitsS., \bauthor\bsnmCevher, \bfnmV.\binitsV., \bauthor\bsnmTan-Dinh, \bfnmQ.\binitsQ. \AND\bauthor\bsnmDunson, \bfnmD. B.\binitsD. B. (\byear2016). \btitleWASP: Scalable Bayes via barycenters of subset posteriors. In \bbooktitleProceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2016) \bpages912–920. \endbibitem
  • [54] {barticle}[author] \bauthor\bsnmSteinsaltz, \bfnmD.\binitsD. \AND\bauthor\bsnmEvans, \bfnmS.\binitsS. (\byear2007). \btitleQuasistationary distributions for one-dimensional diffusions with killing. \bjournalTransactions of the American Mathematical Society \bvolume359 \bpages1285–1324. \endbibitem
  • [55] {barticle}[author] \bauthor\bsnmTanner, \bfnmM. A.\binitsM. A. \AND\bauthor\bsnmWong, \bfnmW. H.\binitsW. H. (\byear1987). \btitleThe calculation of posterior distributions by data augmentation. \bjournalJournal of the American Statistical Association \bvolume82 \bpages528–540. \endbibitem
  • [56] {barticle}[author] \bauthor\bsnmTeh, \bfnmY. W.\binitsY. W., \bauthor\bsnmThiéry, \bfnmA. H.\binitsA. H. \AND\bauthor\bsnmVollmer, \bfnmS. J.\binitsS. J. (\byear2016). \btitleConsistency and fluctuations for stochastic gradient Langevin dynamics. \bjournalJournal of Machine Learning Research \bvolume17 \bpages193–225. \endbibitem
  • [57] {barticle}[author] \bauthor\bsnmVollmer, \bfnmS. J.\binitsS. J., \bauthor\bsnmZygalakis, \bfnmK. C.\binitsK. C. \AND\bauthor\bsnmTeh, \bfnmY. W.\binitsY. W. (\byear2016). \btitleExploration of the (Non-)Asymptotic Bias and Variance of Stochastic Gradient Langevin Dynamics. \bjournalJournal of Machine Learning Research \bvolume17 \bpages1–48. \endbibitem
  • [58] {barticle}[author] \bauthor\bsnmWang, \bfnmX.\binitsX. \AND\bauthor\bsnmDunson, \bfnmD. B.\binitsD. B. (\byear2013). \btitleParallelizing MCMC via Weierstrass Sampler. \bjournalarXiv preprint arXiv:1312.4605. \endbibitem
  • [59] {binproceedings}[author] \bauthor\bsnmWelling, \bfnmM.\binitsM. \AND\bauthor\bsnmTeh, \bfnmY. W.\binitsY. W. (\byear2011). \btitleBayesian learning via stochastic gradient Langevin dynamics. In \bbooktitleProceedings of the 28th International Conference on Machine Learning (ICML-11) \bpages681–688. \endbibitem

Appendix A Proof of Theorem 1

Here we present a proof of Theorem 1. However, we first formally state the required regularity conditions. We suppose that

(22)

and that

(23)

where represents the Laplacian.

Proof (Theorem 1).

Consider the diffusion with generator given by

As is bounded, we assume without loss of generality that its upper bound is . Our proof shall proceed by checking the conditions of Corollary 6 of [28], which establishes the result. In particular, we need to check that the following are satisfied:

  1. For all , the discrete time chain is irreducible;

  2. All closed bounded sets are petite;

  3. We can find a drift function for some , that satisfies the condition

    (24)

    for outside some bounded set, for each with associated positive constant , and where .

The first condition holds for any regular diffusion since the diffusion possesses positive continuous transition densities over time intervals ; and positivity and continuity of the density also implies the second condition. For the final condition we require that

(25)

Now by direct calculation

(26)

so that

(27)

Therefore (25) will hold whenever (23) is true since we have the constraint that and is clearly non-negative. As such the result holds as required. ∎

Note that the condition in (23) is essentially a condition on the tail of . This will hold even for heavy-tailed distributions, and we show this is the case for a class of 1-dimension target densities in Appendix B.

Appendix B Polynomial tails

In this appendix we examine condition (23) which we use within Theorem 1. This is essentially a condition on the tail of , and so we examine the critical case in which the tails of are heavy. More precisely, we demonstrate that for polynomial tailed densities in one-dimension that (23) essentially amounts to requiring that is integrable. Recall that by construction will be integrable as we have chosen .

For simplicity, suppose that is a density on such that . In this case we can easily compute that for ,

from which we can easily compute the quantity whose limit is taken in (23) as

As such, we have that condition (23) holds if and only if

(28)

and

(29)

Now we shall demonstrate that we can find such for all . For instance, suppose that . The case can be handled by just setting