# Encapsulating models and approximate inference programs in probabilistic modules

###### Abstract

This paper introduces the probabilistic module interface, which allows encapsulation of complex probabilistic models with latent variables alongside custom stochastic approximate inference machinery, and provides a platform-agnostic abstraction barrier separating the model internals from the host probabilistic inference system. The interface can be seen as a stochastic generalization of a standard simulation and density interface for probabilistic primitives. We show that sound approximate inference algorithms can be constructed for networks of probabilistic modules, and we demonstrate that the interface can be implemented using learned stochastic inference networks and MCMC and SMC approximate inference programs.

Encapsulating models and approximate inference programs in probabilistic modules

Marco F. Cusumano-Towner Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology marcoct@mit.edu Vikash K. Mansinghka Department of Brain & Cognitive Sciences Massachusetts Institute of Technology vkm@mit.edu

## 1 Introduction

We present the probabilistic module interface, which allows encapsulation of complex latent variable models with custom stochastic approximate inference machinery. The modules interface can be seen as a generalization of previously proposed interfaces for “elementary” random procedures in probabilistic programming languages: it does not require the module author to specify a marginal input-output density. Instead, module authors are only obligated to (i) provide a way to stochastically “regenerate” traces of the internal latent variables, subject to constraints on the module’s output, and (ii) provide a way to calculate a weight for this regeneration. We show this is sufficient for constructing sound approximate inference algorithms over networks of modules, including a Metropolis-Hastings procedure that can be seen as the module-level analogue of the single-site Metropolis-Hastings procedures that are commonly used with “lightweight” implementations of probabilistic programming languages goodman2012church (), wingate2011lightweight ().

This paper illustrates module networks by defining the mathematical interface and providing an example application to linear regression with outliers. This application contains two modules: (i) a complex prior over a binary “model selection” variable determining the prior prevalence of outliers, using a learned bottom-up network for regeneration, and (ii) a linear regression model with binary outlier indicators, using sequential Monte Carlo for regeneration of the outlier indicators (thereby avoiding an exponential sum over all possible indicator settings).

## 2 The probabilistic module interface

In several existing probabilistic programming systems^{1}^{1}1Univariate versions of this interface are used in wood2014new () and dippl () goodman2012church (), mansinghka2014venture (), mansinghka2015bayesdb (), saad2016probabilistic (), probabilistic modeling primitives implement a simulator procedure () which samples outputs given inputs from a distribution and log-density evaluation procedure (), which evaluates . Together, these two procedures enable inference programs to run valid approximate inference algorithms such as MCMC and SMC over the composite probabilistic model. This interface is summarized in Figure 2a.

We propose a stochastic generalization of this interface, called the probabilistic modules interface, that replaces with a stochastic generalization called . Unlike the and interface, the probabilistic modules interface, summarized in Figure 2b, is able to represent probabilistic computations that involve internal ‘auxiliary’ random variables that cannot be exactly marginalized out to compute the log-density . This is made possible by implementing a sampler that samples values for the auxiliary variables given inputs and outputs from a regeneration distribution, which is denoted .

(a) | (b) |

If there are no auxiliary variables , then the procedure reduces to the deterministic procedure. In the presence of auxiliary variables , may be understood as using an unbiased single-sample importance sampling estimate of the output probability, where is the importance distribution: . Indeed, in the extreme setting in which the regeneration distribution is identical to the conditional distribution on auxiliary variables given inputs and outputs (), this estimate is deterministic and exact, and is again identical to . Finally, note that the probabilistic modules interface does not require the auxiliary variables to be stored in memory all at once. This is useful when the log-weight can be incrementally computed during sampling of from and . Such cases are discussed in Section 4.

## 3 Implementing MCMC over probabilistic module networks

When we compose probabilistic modules in a directed acyclic graph, the resulting probabilistic module network has the same declarative semantics as a Bayesian network with nodes for module outputs . Both the module network and the Bayesian network represent the joint distribution on module outputs, with any module auxiliary variables marginalized out. The existence of auxiliary variables in the modules only changes how approximate inference is performed.

Valid MCMC algorithms can easily be constructed over probabilistic module networks. In fact existing standard Metropolis-Hastings (MH) algorithms for inference in Bayesian networks need only a slight modification for use with modules (see Algorithm 1). The only change required is the storage of the current log-weight for each probabilistic module. The current log-weight for module is accessed with lookup-log-weight and updated with update-log-weight during MCMC inference. These values are initialized by running for each module whose output is not observed and for each module whose output is oberved, following a topological ordering of nodes in the network. Note that for single-site MH in a Bayesian network, the lookup-log-weight call and the call of Algorithm 1 are replaced with . A Markov chain constructed from mixtures and cycles of the update of Algorithm 1 admits the posterior as a marginal of its stationary distribution, which is defined on the space of all unobserved module outputs , and all module auxiliary variables .

## 4 Encapsulating models and inference programs in probabilistic modules

We now show how to encapsulate a probabilistic model with internal latents and outputs as a probabilistic module with the declarative semantics of the marginal distribution on outputs , as shown in Figure 1. This is useful if is high dimensional or analytically intractable, and we are unable to implement by marginalizing out exactly.

We begin by defining the module auxiliary variables as the model’s internal latents (). Then, the probabilistic module interface requires us to construct a sampler for the regeneration distribution where is an approximation to such that we can efficiently compute the log-weight . It is sometimes possible to learn a sampler (see morris2001recognition () for a pioneering example of this approach, and the ‘stochastic inverses’ of stuhlmuller2013learning ()) or to learn the model and the regeneration sampler at the same time (e.g. the generative models and recognition networks of kingma2013auto ()) such that the log-weight is tractable. We illustrate this approach for Module of Figure 2a, which uses a learned stochastic inverse network trained using samples from the prior of the model as described in stuhlmuller2013learning (). The log-weight is tractable because the learned contains no additional random variables beyond those in the model itself ().

However, if we wish to use generally applicable stochastic inference programs implementing MCMC andrieu2003introduction () and sequential Monte Carlo (SMC) del2006sequential () for the regeneration distribution , it is not possible to compute because the marginal output density of the stochastic inference program is intractable. To handle these cases, we augment the auxiliary variables of the module to include the execution history of the stochastic inference program (). We define the distribution sampled by as the joint distribution of the stochastic inference program over its execution history and output , denoted . We then extend the distribution sampled by to also sample an execution history alonside the model latents , using a ‘meta-inference’ program cusumano2016quantifying () that samples inference execution history given inference output from a distribution that approximates the conditional distribution on inference execution histories , so that .

As shown in cusumano2016quantifying () and cusumanotowner2016smc (), it is possible to construct meta-inference programs for sequential variants of MCMC using detailed balance transition kernels and for multiple-particle SMC with optional detailed balance transition kernels such that the log-weight can be efficiently computed on the fly when sampling from and . As the accuracy of improves (as happens when the number of particles in SMC increases) the log-weights sampled from the probabilistic module converge to the log density . Module of Figure 3 uses SMC for .

### Acknowledgements

This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. This research was conducted with Government support under and awarded by DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.

## References

- [1] Noah Goodman, Vikash Mansinghka, Daniel M Roy, Keith Bonawitz, and Joshua B Tenenbaum. Church: a language for generative models. arXiv preprint arXiv:1206.3255, 2012.
- [2] David Wingate, Andreas Stuhlmüller, and Noah D Goodman. Lightweight implementations of probabilistic programming languages via transformational compilation. In AISTATS, pages 770–778, 2011.
- [3] Frank Wood, Jan-Willem van de Meent, and Vikash Mansinghka. A new approach to probabilistic programming inference. In AISTATS, pages 1024–1032, 2014.
- [4] Noah D Goodman and Andreas Stuhlmüller. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org, 2014. Accessed: 2016-10-31.
- [5] Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
- [6] Vikash Mansinghka, Richard Tibbetts, Jay Baxter, Pat Shafto, and Baxter Eaves. Bayesdb: A probabilistic programming system for querying the probable implications of data. arXiv preprint arXiv:1512.05006, 2015.
- [7] Feras Saad and Vikash Mansinghka. Probabilistic data analysis with probabilistic programming. arXiv preprint arXiv:1608.05347, 2016.
- [8] Quaid Morris. Recognition networks for approximate inference in bn20 networks. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 370–377. Morgan Kaufmann Publishers Inc., 2001.
- [9] Andreas Stuhlmüller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In Advances in neural information processing systems, pages 3048–3056, 2013.
- [10] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- [11] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1-2):5–43, 2003.
- [12] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
- [13] Marco F Cusumano-Towner and Vikash K Mansinghka. Quantifying the probable approximation error of probabilistic inference programs. arXiv preprint arXiv:1606.00068, 2016.
- [14] Marco F Cusumano-Towner and Vikash K Mansinghka. Measuring the non-asymptotic convergence of sequential Monte Carlo samplers using probabilistic programming. Submitted to the NIPS 2016 Workshop Advances in Approximate Bayesian Inference, 2016.

## Appendix A Appendix A: Deriving single-site Metropolis Hastings in module network

Consider a network of probabilistic modules indexed by . Let denote the internal auxiliary variables or module , let denote the output of module , and let denote the inputs to module . Each module input is a tuple of module outpus for , where is the sequence of parent module indices of module . Let denote the set of children of module : . Let denote the simulation distribution for module and let denote the regeneration distribution. Define the collection of all module auxiliary variables by and the collection of all module outputs by . Define the joint simulation distribution over the network as . The declarative semantics of the module network are derived from the marginal distribution over outputs . Suppose a subset of the modules are observed, meaning their output is constrained to a value . The target distribution of the network is then defined by . We will derive a Metropolis-Hastings algorithm for which the stationary distribution is the distribution . If the algorithm converges to this joint distribution, then the marginal over only converges to the network’s target distribution.

Consider a standard single-site Metropolis-Hastings update targeting the distribution where we propose a new value for the output of some module . Let and denote the state prior to the update. We propose a new value , and then propose . We also propose for all modules , where includes the updated value . We perform an MH accept/reject step using this proposal, and with the ‘local’ posterior as the target distribution. The acceptance ratio is:

(1) |

The log-acceptance ratio is:

(2) | ||||

(3) | ||||

(4) |

This is the acceptance ratio used in Algorithm 1. Therefore Algorithm 1 corresponds to a valid MH update that can be used to compose valid MH algorithms that converge to the posterior , such as a single-site random-scan mixture over Algorithm 1 applications for all unobserved modules .