Piecewise Deterministic Markov Processes for Scalable Monte Carlo on Restricted Domains
In a Bayesian setting, Piecewise Deterministic Monte Carlo allows for subsampling, whilst maintaining the true posterior as invariant distribution. Where current methods are limited to unrestricted parameter spaces, we show how these algorithms can be extended to restricted domains. We also introduce an open-source software package which implements Piecewise Deterministic Monte Carlo methods for sampling densities on general domains.
Markov chain Monte Carlo (MCMC) methods have been central to the wide-spread use of Bayesian methods. However their applicability to some modern applications has been limited due to their high computational cost, particularly in big-data, high-dimensional settings. This has led to interest in new MCMC methods, particularly non-reversible methods which can mix better than standard reversible MCMC DiaconisHolmesNeal2000 (), and variants of MCMC that require accessing only small subsets of the data at each iteration WellingTeh2011 ().
One of the main technical challenges associated with likelihood-based inference for big data is the fact that likelihood calculation is computationally expensive (typically for data sets of size ). This is a particular problem for traditional Markov chain Monte Carlo methods which require at least one likelihood evaluation at each iteration (typically out of thousands). MCMC methods built from piecewise deterministic Markov processes PetersDeWith2012 (); Bouchard-Cote2015 (); bierkens2015piecewise () offer considerable promise for reducing this burden, due to natural subsampling techniques (see for example BierkensFearnheadRoberts2016 (); pakman2014 ()).
Such piecewise deterministic processes explore the state space according to a constant velocity, where the velocity changes at random event times. The rate of these event times, and the change in velocity at each event, are chosen so that the resulting process has the posterior distribution as its invariant distribution. We will refer to this family of sampling methods as Piecewise Deterministic Monte Carlo methods (PDMC).
The PDMC methods proposed to date exhibit properties that suggest they will be particularly suited to large data applications. As mentioned, it is often possible to use only a small subsample of the data at each iteration, whilst still being guaranteed to target the true posterior distribution BierkensFearnheadRoberts2016 (); Bouchard-Cote2015 (); Galbraith2016 (); and factor graph decompositions of the target distribution can be leveraged to perform sparse updates of the variables Bouchard-Cote2015 (); Nishikawa2015 (); PetersDeWith2012 (). Furthermore, the methods involve simulating a non-reversible Markov process, and there is substantial evidence that these have better properties than reversible MCMC methods (see e.g. DiaconisHolmesNeal2000 (); TuritsynChertkovVucelja2011 ()).
Existing PDMC algorithms can only be used to sample from posteriors where the parameters can take any value in . In this paper we show how to extend PDMC methodology to deal with constraints on the parameters. Such models are ubiquitous in machine learning and statistics. For example, many popular models used for binary, ordinal and polychotomous response data are multivariate real-valued latent variable models where the response is given by a deterministic function of the latent variables albert1993bayesian (); fahrmeirmultivariate (); train2009discrete (). Under the posterior distribution, the domain of the latent variables is then constrained based on the values of the responses. Additional examples arise in regression where prior knowledge restricts the signs of marginal effects of explanatory variables such as in econometrics geweke1986exact (), image processing and spectral analysis bellavia2006interior (), guo2012comparison () and non-negative matrix factorization kim2007sparse (). A few methods for dealing with restricted domains are available but these either target an approximation of the correct distribution patterson2013 () or are limited in scope pakman2014 ().
2 Piecewise Deterministic Monte Carlo
Our objective is to compute expectations with respect to a probability distribution on which is assumed to have smooth density, also denoted , with respect to the Lebesgue measure on .
We denote the state of our continuous-time Markov process at time as , taking values in the domain , where and are subsets of , such that is open. The dynamics of are easy to describe if one views as position and as velocity. The position process moves deterministically, with constant velocity between a discrete set of switching times which are simulated according to inhomogeneous Poisson processes, with respective intensity functions , , depending on the current state of the system. At each switching time the position stays the same, but the velocity is updated according to a specified transition kernel. More specifically, suppose the next switching event occurs from Poisson process, then the velocity immediately after the switch is sampled randomly from the probability distribution , i.e. for any measurable subset , the probability that the new velocity is in given the current position and velocity before the switch is . The switching times are random, and designed in conjunction with the kernels so that the invariant distribution of the process coincides with the target distribution .
The resulting stochastic process is a Piecewise Deterministic Markov Process (PDMP, davis1984piecewise ()). For it to be useful as the basis of a Piecewise Deterministic Monte Carlo (PDMC) algorithm we need to (i) be able to easily simulate this process; and (ii) have simple recipes for choosing the intensities, , and transition kernels, , such that the resulting process has as its marginal stationary distribution. We will tackle each of these problems in turn.
We shall first assume that . The key challenge in simulating our PDMP is simulating the event times. The intensity of events is a function of the state of the process. But as the dynamics between event times are deterministic, we can easily represent the intensity for the next event as a deterministic function of time. Suppose that the PDMP is driven by a single inhomogeneous Poisson process with intensity function
We can simulate the first event time directly if we have an explicit expression for the inverse function of the monotonically increasing function
In this case the time until the next event is obtained by (i) simulating a realization, say, of an exponential random variable with rate ; and (ii) setting the time until the next event as the value that solves .
In practice inverting Equation 1 is often not practical. In such cases simulation can be carried via thinning lewis1979simulation (). This requires finding a tractable upper bound on the rate, for all . Such an upper bound will typically take the form of a piecewise linear function or a step function. Note that the upper bound is only required to be valid along the trajectory in . Therefore the upper bound will depend on the starting point of the line segment we are currently simulating.
We then propose potential events by simulating events from an inhomogenous Poisson process with rate , and accept an event at time with probability . The time of the first accepted event will be the time until the next event in our PDMP.
For a PDMP driven by inhomogeneous Poisson processes with intensities the previous steps lead to the following algorithm for simulating the next event of our PDMP. This algorithm can be iterated to simulate the PDMP for a chosen number of events or a pre-specified time-interval.
Initialize: Set to the current time and to the current position and velocity.
Determine bound: For each , find a convenient function satisfying for all , depending on the initial point from which we are departing.
Propose event: For , simulate the first event times of a Poisson process with rate function .
Let and .
Accept/Reject event: With probability
accept this event.
Upon acceptance: set ; simulate a new velocity from and set .
Upon rejection: set and set .
Update: Record the time and state .
2.2 Output of PDMC Algorithms
The output of these algorithms will be a sequence of event times and associated states , . To obtain the value of the process at times , we can linearly interpolate the continuous path of the process between event times, i.e. . Time integrals of a function of the process can be approximated by numerically integrating the one dimensional integral along the piecewise linear trajectory of the PDMP. However, in many instances, such integrals can be computed analytically from the output of the above algorithm. Alternatively we can sample the PDMP at a set of evenly spaced time points along the trajectory and use this collection as an approximate sample from our target distribution. Note that using the values of the process at event times will not produce samples drawn from the stationary distribution of the PDMP davis1984piecewise ().
2.3 Choosing the Intensity and Transition Kernel
Most existing PDMC methods BierkensFearnheadRoberts2016 (); Bouchard-Cote2015 (); PetersDeWith2012 () assume that the target density, , is defined on . They further require that be differentiable. Assuming these two conditions hold, we can provide criteria which must hold for a given probability distribution to be a stationary distribution of . We shall consider stationary distributions for which and are independent, i.e. distributions of the form on . Furthermore we assume that
where is continuously differentiable.
We impose the condition that
for all and measurable. This is implied by the easier but stronger condition that each is reversible with respect to , i.e. for ,
Moreover, we shall require the following condition which relates the probability flow with the switching intensities :
for all . The following result is proved in the supplementary material, A.1.
Under the assumption that the resulting PDMP is ergodic (for sufficient conditions see e.g. BierkensFearnheadRoberts2016 (); Bouchard-Cote2015 ()), we have the following version of the law of large numbers jacod2013limit () for the PDMP : For all square integrable functions (i.e. ), we have that, with probability one,
It is this formula which allows us to use PDMPs for Monte Carlo purposes.
2.4 Example: The Bouncy Particle Sampler
Current PDMC algorithms differ in terms of how the and are chosen such that the above equation holds for some simple distribution for the velocity. Here we discuss how the Bouncy Particle Sampler (BPS), introduced in PetersDeWith2012 () and explored in Bouchard-Cote2015 (), is an example of the framework introduced here. In the supplementary material, A.1.1, the Zig-Zag sampler is described as a second example. In the following example denotes the Dirac-measure centered in .
The Bouncy Particle Sampler is obtained setting and on or , i.e. the uniform distribution on the unit sphere. The single switching rate is chosen to be , with corresponding switching kernel which reflects with respect to the orthogonal complement of with probability 1:
where denotes orthogonal projection along the one dimensional subspace spanned by .
As noted in Bouchard-Cote2015 () this algorithm suffers from reducibility issues. These can be overcome by refreshing the velocity by drawing a new velocity independently from . In the simplest case the refreshment times come from an independent Poisson process with constant rate . This also fits in the framework above by choosing and
As a generalization of the BPS, one can consider a preconditioned version, which is obtained by introducing a constant positive definite symmetric matrix to rescale the velocity process, allowing certain directions to be preferred. More specifically, we set , and
where . It is straightforward to check that conditions (3) and (5) hold, and so is an invariant measure. The choice of plays a very similar role to the mass matrix in HMC, and careful tuning can give rise to dramatic increases in performance Galbraith2016 (). For Gaussian targets, the natural choice of is the covariance of the target distribution. See pakman2016 () for a related approach.
3 Sampling over Restricted Domains
Suppose now that is an open pathwise-connected subset of with Lipschitz boundary . We can adapt the PDMP scheme described previously to sample from a distribution supported on . To ensure that remains confined within the velocity of the process is updated whenever hits so that the process moves back into . We shall refer to such updates as boundary reflections even though they need not be specular reflections.
To handle boundary reflections, at every given time , we keep track of the next reflection event in the absence of a switching event, i.e. we compute
If the boundary can be represented as a finite set of hyper-planes in , then the cost of computing is . When generating the switching event times and positions for we determine whether a boundary reflection will occur before the next potential switching event. If so, then we induce a switching event at time where and sample a new velocity from the transition kernel , i.e. .
Although theoretically we may choose a new velocity pointing outwards and have an immediate second jump, we will for algorithmic purposes assume that the probability measure for is concentrated on those directions for which , where is the outward normal at .
Based on this, we thus modify steps (2)–(4) of the PDMP scheme as follows.
Propose event: For simulate the first event times of a Poisson process with rate function . Compute the next boundary reflection time .
Let and .
If then set ; set ; sample a new velocity .
Otherwise with probability
accept the event at time .
Upon acceptance: set ; sample a new velocity .
Upon rejection: set and set .
To ensure that the process is an invariant distribution of we must impose conditions on the boundary transition kernel . The following result provides one such sufficient condition. The proof is deferred to the supplementary material, A.2.
In practice we only have to consider the exit region . For example if and , then . The specification of on is irrelevant as these points are never reached by . On this irrelevant set, we may choose as desired to satisfy (7). In practice it therefore suffices to check (4) and (7) on .
Another possible behavior at the boundary is to generate the new reflected direction independently of the angle of incidence. This will also preserve the invariant distribution provided that is isotropic.
3.1 Example: Bouncy Particle Sampler
for , so that the process reflects specularly at the boundary (i.e. angle of incidence equals angle of reflection of process with respect to the boundary normal). It is straightforward to check that condition (3) holds at the boundary and that (7) is satisfied. Similarly, for the preconditioned BPS with mass matrix, the natural choice of reflection at the boundary is given by
for which it is straightforward to confirm that the conditions of Proposition 2 are satisfied.
4 Software and Numerical Experiments
A open-source Julia package PDMP.jl has been developed to provide efficient implementations of various recently developed piecewise deterministic Monte Carlo methods for sampling in (possibly restricted) continuous spaces. A variety of algorithms are implemented including the Zig-Zag sampler and the Bouncy Particle Sampler with full and local refreshment along with control variate based sub-sampling for these methods. The package has been specifically designed with extensibility in mind, permitting rapid implementation of new PDMP based methods. The library along with code and documentation is available at github.com/alan-turing-institute/PDMP.jl.
We use Bayesian binary logistic regression as a testbed for our newly proposed methodology and perform a simulation study. The data is modelled by
where are fixed covariates and . We will assume that we wish to fit this model under some monotonicity constraints – so that the probability of is known to either increase or decrease with certain covariates. This is modeled through the constraint and respectively. An example where such restrictions occur naturally is in logistic regression for questionnaires, see Tutz2014 (). In following we consider the case for .
For simplicity we use a flat prior over the space of parameters values consistent with our constraints. By Bayes’ rule the posterior satisfies
where is the space of parameter values consistent with our constraints. We implement the BPS with subsampling. As explained in the introduction, subsampling is a key benefit of using piecewise deterministic sampling methods. An informal overview of this idea and its application is located in the supplementary material, A.3. We use reflection at the boundary i.e. for . We can bound the switching intensity by a linear function of time, even when we use the subsampling estimator for the switching rate. See the supplementary material, A.4 for details on the application of subsampling in this example. We use and and generate artificial data based on and whose components are a realization i.i.d. uniformly distributed random variables on . In Figure 1 we compare standard HMC with BPS in terms of effective sample size (ESS) per epoch of dataset evaluation.
For HMC we use leap-frog steps. We find that we must tune HMC to have a small step size due to proposals being rejected at the boundary. Note that there exists a version of HMC which can sample from truncated Gaussian distributions pakman2014 () but to our knowledge no efficient HMC scheme is available for general restricted domains. We note that due to increasingly many rejections at the boundary, the effective sample size of the HMC scheme is relatively unchanged over increasing epochs. On the other hand, we observe that the BPS shows a clear advantage over the HMC scheme in this case, with the effective sample size growing commensurately with epoch. The Bouncy Particle Sampler for this model was implemented using PDMP.jl while the corresponding HMC sampler implemented with Klara.jl. The code for this numerical experiment along with results are carefully presented in github.com/tlienart/ConstrainedPDMP/.
This work provides a framework for describing a general class of PDMC methods which are ergodic with respect to a given target probability distribution. Open the questions remain on how the choice of intensity function, velocity transition kernel as well as other parameters of the system influence the overall performance of the scheme. The problem of understanding the true computational cost of such PDMC schemes is more subtle than for classical discrete time MCMC schemes. Indeed, a PDMC scheme which maximizes rate of convergence to equilibrium or the speed of mixing need not be an optimal choice if the average switching rate (which is proportional to the number of gradient evaluations per unit time) is very high. Further investigation is required to understand this delicate balance.
All authors thank the Alan Turing Institute and Lloyds registry foundation for support. S.J.V. gratefully acknowledges funding through EPSRC EP/N000188/1. J.B., P.F. and G.R. gratefully acknowledge EPSRC ilike grant EP/K014463/1. A.B.D. acknowledges grant EP/L020564/1.
- (1) J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association, 88(422):669–679, 1993.
- (2) R. Bardenet, A. Doucet, and C. Holmes. On markov chain monte carlo methods for tall data. arXiv preprint arXiv:1505.02827, 2015.
- (3) S. Bellavia, M. Macconi, and B. Morini. An interior point newton-like method for non-negative least-squares problems with degenerate solution. Numerical Linear Algebra with Applications, 13(10):825–846, 2006.
- (4) J. Bierkens, P. Fearnhead, and G. Roberts. The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data, 2016.
- (5) J. Bierkens and G. Roberts. A piecewise deterministic scaling limit of lifted metropolis-hastings in the curie-weiss model. arXiv preprint arXiv:1509.00302, 2015.
- (6) A. Bouchard-Côté, S. J. Vollmer, and A. Doucet. The Bouncy Particle Sampler: A Non-Reversible Rejection-Free Markov Chain Monte Carlo Method. arXiv preprint arXiv:1510.02451, 2015.
- (7) M. H. A. Davis. Piecewise-deterministic markov processes: A general class of non-diffusion stochastic models. Journal of the Royal Statistical Society. Series B (Methodological), pages 353–388, 1984.
- (8) P. Diaconis, S. Holmes, and R. Neal. Analysis of a nonreversible Markov chain sampler. Annals of Applied Probability, 10(3):726–752, 2000.
- (9) L. Fahrmeir and G. Tutz. Multivariate statistical modelling based on generalized linear models. 2001. Springer, New York.
- (10) P. Fearnhead, J. Bierkens, M. Pollock, and G. O. Roberts. Piecewise deterministic markov processes for continuous-time monte carlo. arXiv preprint arXiv:1611.07873, 2016.
- (11) N. Galbraith. On Event-Chain Monte Carlo Methods. Master’s thesis, Department of Statistics, Oxford University, 2016.
- (12) J. Geweke. Exact inference in the inequality constrained normal linear regression model. Journal of Applied Econometrics, 1(2):127–141, 1986.
- (13) Y. Guo and M. Berman. A comparison between subset selection and l1 regularisation with an application in spectroscopy. Chemometrics and Intelligent Laboratory Systems, 118:127–138, 2012.
- (14) J. Jacod and A. Shiryaev. Limit theorems for stochastic processes, volume 288. Springer Science & Business Media, 2013.
- (15) H. Kim and H. Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12):1495–1502, 2007.
- (16) P. A. Lewis and G. S. Shedler. Simulation of nonhomogeneous poisson processes by thinning. Naval Research Logistics Quarterly, 26(3):403–413, 1979.
- (17) Y. Nishikawa, M. Michel, W. Krauth, and K. Hukushima. Event-chain algorithm for the Heisenberg model. Physical Review E, 92(6):63306, 2015.
- (18) A. Pakman, D. Gilboa, D. Carlson, and L. Paninski. Stochastic bouncy particle sampler. arXiv preprint arXiv:1609.00770, 2016.
- (19) A. Pakman and L. Paninski. Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.
- (20) S. Patterson and Y. W. Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102–3110, 2013.
- (21) E. A. J. F. Peters and G. De With. Rejection-free Monte Carlo sampling for general potentials. Physical Review E, 85(2):1–5, 2012.
- (22) L. E. Train. Discrete choice methods with simulation. Cambridge university press, 2009.
- (23) K. S. Turitsyn, M. Chertkov, and M. Vucelja. Irreversible Monte Carlo algorithms for efficient sampling. Physica D: Nonlinear Phenomena, 240(4-5):410–414, 2011.
- (24) Gerhard Tutz and Jan Gertheiss. Rating scales as predictors—the old question of scale level and some answers. Psychometrika, 79(3):357–376, 2014.
- (25) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
Appendix A Supplementary Material
In the supplementary material we shall provide a theoretical background for the framework for unrestricted domain PDMC (A.1, which includes the Zig-Zag sampler as further example) and restricted domain PDMC (A.2). Some background information on subsampling is provided in A.3, and the application of subsampling to the logistic regression example may be found in A.4.
a.1 Sampling from domains without restrictions
Let us first focus on the case where . From (davis1984piecewise, , Section 5) the process will have infinitesimal generator given by the closure of the operator
where is the set of functions which are continuously differentiable with respect to and which decay sufficiently fast as . Having characterized the infinitesimal generator, we can now prove Proposition 1.
Sketch Proof of Proposition 1.
Without loss of generality we take , i.e. the proportionality factor in is assumed to be 1. We shall only provide a formal proof of this result, by demonstrating that
so that is infinitesimally invariant. A rigorous proof would require establishing that as defined above is a core for the extended generator. This is a technical result which we defer for future work. Let , then we have that
It follows that
so that is infinitesimally invariant with respect to ∎
a.1.1 The Zig-Zag sampler
The Zig-Zag sampler BierkensFearnheadRoberts2016 () can be recovered by choosing and picking as velocity space equipped with discrete uniform distribution , defining switching rates . The corresponding switching kernels over new directions are given by
where denotes the operation of flipping the -th component, i.e. , and for .
a.2 Sampling from domains with restrictions
We now consider the case where is a pathwise connected, open subset of with Lipschitz boundary . Following davis1984piecewise (), the PDMP will have infinitesimal generator given by the closure of (9) with domain of continuously differentiable functions vanishing sufficiently fast as , such that
for all . Based on this identification of the infinitesimal generator we can now provide a formal proof that the conditions of Proposition 2 are sufficient to ensure invariant of .
Sketch Proof of Proposition 2.
Following the argument of the proof of Proposition 1, for , we can write
so that the boundary term evaluates to zero. The argument then continues as in Proposition 1, thus establishing the invariance of the target distribution. ∎
When using PDMC to sample from a posterior, we can use sub-samples of data at each iteration of the algorithm, as described in BierkensFearnheadRoberts2016 (). In the following we will assume that we can write the posterior as for some function . For example this would be the likelihood for a single IID data point times the th power of the prior.
We can then introduce the estimator of by
where is drawn uniformly from .
The idea of using sub-sampling, within say the Bouncy Particle Sampler (BPS), is that at each iteration of our PDMC algorithm we can replace by an unbiased estimator in step (3). We need to use the same estimate both when calculating the actual event rate in the accept/reject step and, if we accept, when simulating the new velocity. Whilst for each implementation of step (3) we need to use an independent estimate. The only further alteration we need to the algorithm is to choose an upper bound that holds for all realizations of . A more comprehensive explanation of this argument can be found in fearnhead2016piecewise () in the context of the Zig-Zag sampler, and in pakman2016 () for the bouncy particle sampler.
It is straightforward to show that the resulting BPS algorithm uses an event rate that is , and that this rate and the resulting transition probability at events satisfies Proposition 1 and 7. Hence this algorithm still targets , but only requires access to one data point at each accept-reject decision.
Note that this gain in computational efficiency does not come for free, as it follows from Jensen’s inequality that the overall rate of events will be higher. This makes mixing of the PDMC process slower. It is also immediate that the bound, , we will have to use will be higher. However BierkensFearnheadRoberts2016 () show that if our estimator of has sufficiently small variance, then we can still gain substantially in terms of efficiency. In particular they give an example where the CPU cost effective sample size does not grow with – by comparison all standard MCMC algorithms would have a cost that is at least linear in .
To obtain such a low-variance estimator requires a good choice of , so that with high probability will be closer to . This involves a preprocessing step to find a value close to the posterior mode, a preprocessing step to then calculate is also needed.
We now illustrate how to find an upper bound on the event rate. Following BierkensFearnheadRoberts2016 (), if we assume is a uniform (in space and ) upper bound on the largest eigenvalue of the Hessian of , and if :
Thus the upper bound on the intensity is of the form with . In this case the first arrival time can be simulated as follows
a.4 Derivation of dominating intensity for logistic regression example
A valid choice of can be derived as follows: Notice that and so that we obtain
Such that Equation 12 holds with
as defined above.