Nesting Probabilistic Programs
Abstract
We formalize the notion of nesting probabilistic programming queries and investigate the resulting statistical implications. We demonstrate that query nesting allows the definition of models which could not otherwise be expressed, such as those involving agents reasoning about other agents, but that existing systems take approaches that lead to inconsistent estimates. We show how to correct this by delineating possible ways one might want to nest queries and asserting the respective conditions required for convergence. We further introduce, and prove the correctness of, a new online nested Monte Carlo estimation method that makes it substantially easier to ensure these conditions are met, thereby providing a simple framework for designing statistically correct inference engines.
1 0pt
4pt plus 2pt minus 2pt0pt plus 2pt minus 0pt \titlespacing
1.1 0pt
4pt plus 2pt minus 2pt0pt plus 2pt minus 0pt \titlespacing
0pt
4pt plus 2pt minus 2pt0pt plus 2pt minus 0pt
2 Introduction
Probabilistic programming systems (PPSs) allow probabilistic models to be represented in the form of a generative model and statements for conditioning on data (Goodman et al., 2008; Gordon et al., 2014). Informally, one can think of the generative model as the definition of a prior, the conditioning statements as the definition of a likelihood, and the output of the program as samples from a posterior distribution. Their core philosophy is to decouple model specification and inference, the former corresponding to the user-specified program code and the latter to an inference engine capable of operating on arbitrary programs. Removing the need for users to write inference algorithms significantly reduces the burden of developing new models and makes effective statistical methods accessible to non-experts.
Some, so-called universal, systems (Goodman et al., 2008; Goodman and Stuhlmüller, 2014; Mansinghka et al., 2014; Wood et al., 2014) further allow the definition of models that would be hard, or even impossible, to convey using conventional frameworks such as graphical models. One enticing manner they do this is by allowing arbitrary nesting of models, known in the probabilistic programming literature as queries (Goodman et al., 2008), such that it is easy to define and run problems that fall outside the standard inference framework (Goodman et al., 2008; Mantadelis and Janssens, 2011; Stuhlmüller and Goodman, 2014; Le et al., 2016). This allows the definition of models that could not be encoded without nesting, such as experimental design problems (Ouyang et al., 2016) and various models for theory-of-mind (Stuhlmüller and Goodman, 2014). In particular, models that involve agents reasoning about other agents require, in general, some form of nesting. For example, one might use such nesting to model a poker player reasoning about another player as shown in Appendix A. Here the nesting is essential because each player has access to information the other player does not and must carry out their own inference to try and unearth this information. As machine learning increasingly starts to try and tackle problem domains that require interaction with humans or other external systems, such as the need for self-driving cars to account for the behavior of pedestrians, we believe that such nested problems are likely to become increasingly common and that PPSs will form a powerful tool for encoding them.
However, previous work in the probabilistic-programming literature has, in general, implicitly, and incorrectly, assumed that the convergence results from standard inference schemes carries over directly to the nested setting. In practice, performing inference for nested queries falls outside the scope of conventional convergence proofs and so additional work is required to prove the validity of PPS inference engines for nested queries. Such problems constitute special cases of nested estimation. In particular, the use of Monte Carlo (\mc) methods by most PPSs mean they form particular instances of nested Monte Carlo (NMC) estimation (Hong and Juneja, 2009). Recent work (Rainforth et al., 2016a, 2017a; Fort et al., 2017) has demonstrated the consistency and convergence rates for NMC estimation for general classes of models. These results show that NMC is consistent, but that it entails a convergence rate in the total computational cost which decreases exponentially with the depth of the nesting. Furthermore, additional assumptions are required to achieve this convergence, most noticeably that, except in a few special cases, one needs to drive not only the total number of samples used to infinity, but also the number of samples used at each layer of the estimator, a requirement generally flaunted by existing PPSs.
The aim of this work is to formalize the notion of query nesting and use these recent NMC results to investigate the statistical correctness of the resulting procedures carried out by PPS inference engines. To do this, we postulate that there are three distinct ways one might nest one query within another: sampling from the conditional distribution of another query (which we refer to as nested inference), factoring the trace probability of one query with the partition function estimate of another (which we refer to as nested observation), and using expectation estimates calculated using one query as first class variables in another. We use the aforementioned NMC results to assess the relative correctness of each of these categories of nesting. In the interest of exposition, we will mostly focus on the PPS Anglican (Tolpin et al., 2016; Wood et al., 2014) (and also occasionally Church (Goodman et al., 2008)) as a basis for our discussion, but note that our results apply more generally. For example, our nested inference case covers the problem of sampling from cut distributions in OpenBugs (Plummer, 2015).
We find that nested inference is statistically challenging and incorrectly handled by existing systems, while nested observation is statistically straightforward and done correctly. Using estimates as variables turns out to be exactly equivalent to generic NMC estimation and must thus be dealt with on a case-by-case basis. Consequently, we will focus more on nested inference than the other cases.
To assist in the development of consistent approaches, we further introduce a new online NMC scheme and show its convergence rate is only nominally worse than conventional NMC. This scheme obviates the need to revisit previous samples when refining estimates, thereby substantially simplifying the process of writing a consistent online nested inference scheme, as required by most PPSs.
3 Background and Preliminaries
3.1 Nested Monte Carlo
We start by providing a brief introduction to NMC, using similar notation to that of Rainforth et al. (2017a). Conventional \mcestimation approximates an intractable expectation of a function using
(1) |
where , resulting in a mean squared error (MSE) that decreases at a rate . For nested estimation problems, then is itself intractable, corresponding to a nonlinear mapping of a (nested) estimation. Thus in the single nesting case, giving
where each is drawn independently and is now a NMC estimate using samples.
More generally, one may have multiple layers of nesting. To notate this, we first presume some fixed integral depth (with corresponding to conventional estimation), and real-valued functions . We then recursively define ,
for . Our goal is to estimate , for which the NMC estimate is defined recursively using
for , where each is drawn independently. Note that there are multiple values of for each associated and that is still a random variable given , while the overall cost of the NMC estimator is .
As shown by Rainforth et al. (2017a), then if each is continuously differentiable and
, then the MSE converges at rate
(2) |
where and are bounds on the magnitude of the first and second derivatives of respectively, and represents asymptotically dominated terms. Theorem 2 of Rainforth et al. (2017a) further tells us that the continuously differentiability assumption to hold almost surely (rather than absolutely) for convergence more generally, sure that functions with measure-zero discontinuity points still, in general, converge.
We see that for (2) that if any of the remain fixed, there is a minimum error that can be achieved: convergence requires each . As we will later show, many of the shortfalls in dealing with nested queries by existing PPSs revolve around implicitly fixing and taking only .
For a given total sample budget , the bound is tightest when giving a convergence rate of . The intuition behind this potentially surprising optimum setting is that the variance is mostly dictated by and bias by the other . We see the unfortunate result that the convergence rate diminishes exponentially with . However, this optimal setting of the still gives a substantially faster rate than the from naïvely setting .
3.2 The Anglican PPS
Anglican is a universal probabilistic programming language integrated into Clojure (Hickey, 2008), a dialect of Lisp. There are two important ideas to understand for reading Clojure: almost everything is a function and parentheses cause evaluation. For example, is coded as \lsi(+ a b) where \clj+ is a function taking two arguments and the parentheses cause the function to evaluate.
Anglican inherits most of the syntax of Clojure, but extends it with the key special forms \sampleand \observe(Tolpin et al., 2015, 2016), between which the distribution of the query is defined. Informally, \samplespecifies terms in the prior and \observeterms in the likelihood. More precisely, \sampleis used to make random draws , where is a subset of the variables in scope at the point of sampling, and \observeis used to condition on data with defined in the same way as . The syntax of \sampleis to take a distribution object as its only input and return a sample, while \observetakes a distribution object and an observation and returns \lsinil, while changing the program trace probability in Anglican’s back-end. Anglican provides a number of elementary random procedures in the form of distribution classes for common sampling distributions, but also allows users to define their own distribution classes using the \defdistmacro. Distribution objects are generated by calling the class constructor with the required parameters, e.g. \lsi(normal 0 1).
Anglican queries are written using the macro \defquery. This allows users to define a model using a mixture of \sampleand \observestatements and deterministic code, and bind that model to a variable. As a simple example,
corresponds to a model where we are trying to infer the mean and standard deviation of a Gaussian given some data. The syntax of \defqueryis \lsi(defquery name [args] body) such that we are binding the query to variable \cljmy-query here. The query starts by sampling and , before constructing a distribution object \lsilik to use for the observations. It then maps over each datapoint and observes it under the distribution \lsilik. After the observations are made, \lsimu and \lsisigma are returned from the variable-binding \clletblock and then by proxy the query itself. Denoting the data as this particular query defines the correctly normalized joint distribution (note some queries are not normalized)
Inference on a query is performed using the macro \doquery, which produces a lazy infinite sequence of approximate samples from the conditional distribution and, for appropriate inference algorithms, an estimate of the partition function. Its calling syntax is \lsi(doquery inf-alg model inputs & options).
Key to our purposes is Anglican’s ability to nest queries within one another, as done by, for example, Le et al. (2016). Though \doqueryis not allowed directly within a \defqueryblock, one can still nest queries by using the special form \conditionalwhich takes a query and returns a distribution object constructor ostensibly corresponding to the conditional distribution (informally the posterior) defined by the query, for which the inputs to the query become the parameters. We will introduce \conditionalin more depth in the next section, where we will show that its true behavior deviates from this, thereby leading to inconsistent nested inference schemes.
4 Nested Inference
One of the clearest ways one might want to nest queries is by sampling from the conditional
distribution of one query inside another. A number of examples of this are provided
for Church
in Stuhlmüller and Goodman (2014).
Consider the following unnested model using the Anglican function declaration \defm
compared to the following nested query model
where we have replaced \defmwith a query nesting using \conditional. The unnormalized distribution on traces for the first model is straightforwardly given by
for which we can use conventional Monte Carlo inference schemes. The second model differs in that is conditionally normalized before being used in the outer query
The key point here is that the partial normalization constant depends on and so is doubly intractable because we cannot evaluate our unnormalized target distribution exactly. By normalizing within the outer query, \conditionalhas changed the distribution defined by the program.
Another interesting way of looking at this is that wrapping \cljinner in \conditionalhas “protected” from the conditioning in \cljinner (noting that ) such that the \observestatement only affects the probability of given and not the probability of . This insight shows us why the nested inference framework is equivalent to that of sampling from a cut distribution when there is only a single layer of nesting.
Such nested inference problems fall under a more general framework of inference for so-called doubly (or multiply) intractable distributions (Murray et al., 2006; Stuhlmüller and Goodman, 2014). The key feature of these problems is that they include terms with unknown, parameter dependent, normalization constants. For nested probabilistic programming queries, this is manifested in conditional normalization within a query. When there is only a single layer of nesting, these problems are equivalent to the notion of performing inference in “cut models” (Plummer, 2015), whereby the sampling of subsets of the variables is made with factors of the overall likelihood omitted.
A more realistic demonstration of the utility of defining nested inference problems is given in Appendix A where we consider the modeling of two poker players.
4.1 Formalization
To formalize the nested inference problem more generally, let and denote all the random variables of an outer query that are respectively passed and not passed to the inner query. Further, let denote all random variables generated in the inner query – for simplicity, we will assume, without loss of generality, that these are all returned to the outer query, but that some may not be used. The unnormalized density for our outer query can now be written in the form
(3) |
where is the normalized density of the outputs of the inner query and encapsulates all other terms influencing the trace probability of the outer query. Now the inner query defines an unnormalized density that can be evaluated pointwise and we have
(4) |
(5) |
where is our target distribution and we can directly evaluate the numerator, but the denominator is intractable and must be evaluated separately for each possible value of . Our previous example is now achieved by fixing and . We can further straightforwardly extend to the multiple nesting layer setting by recursively defining in the same way as , rather than presuming that it is available in closed form.
4.2 Relationship to Nested Estimation
To relate the nested inference problem back to the nested estimation formulation, we consider using a proposal to calculate the expectation of some arbitrary function under as per self-normalized importance sampling
(6) |
Both the denominator and numerator are now nested expectations where the nonlinearity comes from the fact that we are using the reciprocal of an expectation. A similar reformulation could also be applied in cases with multiple layers of nesting, i.e. where \cljinner itself makes use of another query. The result can also be directly extended to the sequential Monte Carlo (SMC) setting by invoking extended space arguments (Andrieu et al., 2010).
Typically is not known upfront and we instead return an empirical measure from the program in the form of weighted samples which can later be used to estimate an expectation. Namely, if we sample and and all return samples (such that each is duplicated times in the sample set) then our unnormalized weights are given by
(7) |
This, in turn, gives us the empirical measure
(8) |
where is a delta function centered on .
By definition, the convergence of this empirical measure to the target
requires that
expectation estimates calculated using it converge in
probability for any integrable (presuming our proposal
is valid). We thus see that
the convergence of the ratio of nested
expectations in (6) for any arbitrary , is equivalent
to the produced samples converging to the distribution defined by the program.
The NMC results then tell us this will happen in the limit
provided that is strictly positive for all possible (as otherwise
the problem becomes ill-defined).
This problem equivalence will hold whether we elect to use this importance sampling based approach or not, while we see that our finite sample estimates must be biased for non-trivial by the convexity of and Theorem 4 of Rainforth et al. (2017a). Presuming we cannot produce exact samples from the inner query or exploit one the special cases (these are respectively considered in this later in Section 4.7 and Section 4.6), we thus see that there is no “silver bullet” that can reduce the problem to a standard estimation.
For example, if we were to use an MCMC sampler for the inner query then we find that we need to start a new Markov chain for each call, because each value of defines a different local inference problem. Thus we will inevitably produce biased samples for , giving an asymptotic bias unless each call of the inner query runs the MCMC sampler for an infinitely long time; this is thus no better than using importance sampling. We note that such an approach effectively equates to what is referred to as multiple imputation by Plummer (2015).
Another possible approach would be to, instead of drawing samples to estimate (6) directly by NMC, importance sample times for each call of the inner query and then return a single sample from these, drawn in proportion to the importance weights. However, if we Rao Blackwellize (Casella and Robert, 1996) the selection of the sample (which always improves the fidelity of the estimator), we find that this recovers (8).
4.3 Shortfalls of Existing Systems
Using the empirical measure (8) provides one possible manner of producing a consistent estimate of our target by taking and so we can use this as a gold-standard reference approach (with a large value of ) to assess whether Anglican returns samples for the correct target distribution. To this end, we ran Anglican’s importance sampling inference engine on the simple model introduced earlier and compared its output to the reference approach using and . As shown in Figure 1, the samples produced by Anglican are substantially different to the reference code, demonstrating that the outputs do not match their semantically intended distribution. For reference, we also considered the distribution induced by the “unnested” model, where \conditionalis replaced by \cljdefm, and a naïve estimation scheme where a sample budget of is used for each call to \cljinner, effectively corresponding to ignoring the \observestatement by directly returning the first draw of . We see that the unnested model defines a noticeably different distribution, while the behavior of Anglican is similar, but distinct, to ignoring the \observestatement in the inner query. Further investigation shows that the default behavior of \conditionalin a query nesting context is equivalent to an un-Rao-Blackwellized form of the reference code but with , inducing a substantial bias. More generally, the Anglican source code shows that \conditionaldefines a Markov chain generated by equalizing the output of the weighted samples generated by running inference on the query. When used to nest queries, this Markov chain is only ever run for a finite length of time, specifically one accept-reject step is carried out, and so does not produce samples from the true conditional distribution.
Plummer (2015) noticed that WinBugs and OpenBugs (Spiegelhalter et al., 1996) similarly do not provide valid inference when using their cut function primitives, which effectively allow the definition of nested inference problems. However, they do not notice the equivalence to the NMC formulation and instead propose a heuristic for reducing the bias that itself has no theoretical guarantees.
4.4 Online Nested Monte Carlo
Now that we have established that it will, in general, be necessary to use a nested estimation scheme for nested inference problems, it is natural to ask: what behavior do we need for Anglican’s \conditional, or query nesting through sampling more generally, to ensure consistency?
At a high level, the NMC results show us that we need the computational budget of each call to the inner query to become arbitrarily large for convergence, such that we use an infinite number of samples at each layer of the estimator. However, this will typically be highly inconvenient to actually implement in a PPS where one usually desires to provide online estimates in the form of a lazy sequence of samples that converges to the target distribution. Imagine that we have already calculated an NMC estimate, but now desire to refine it further. In general, this will require us to increase all , meaning that we need to revisit previous samples of the outermost query and refine each of their estimates. This significantly complicates practical implementation, necessitating additional communication between queries, introducing overheads, and potentially substantially increasing the memory requirements.
To alleviate these issues, we propose to instead only increase the computational budget of new calls to the inner query, such that earlier call use less samples than later calls. This removes the need for communication between different calls and requires only the storage of the number of times the outermost query has previously been sampled. We refer to this approach as online NMC (ONMC), which, to the best of our knowledge, has not been previously considered in the literature. As we now show, it only leads to a surprisingly modest reduction in the converge rate of the resultant estimator compared to NMC.
Let be monotonically increasing functions dictating the number of samples used by ONMC at depth for the -th iteration of the outermost estimator. By contrast, the standard NMC formulation will use such samples. The ONMC estimator is now defined recursively using
such that later samples of the outermost estimator use more inner samples. We are now ready to show the consistency of the ONMC estimator.
Theorem 1.
If each for some constants then the mean squared error converges to zero as .
Proof.
Let and let be a NMC estimator that uses samples at each layer. We have
Substituting in the variance and bias for now gives
(9) |
where as before, represents asymptotically dominated terms. Here the first two terms clearly tend to zero as . For the last term, we invoke Kronecker’s lemma which tells us that if . Except for the , all terms in the sum as constants that unrelated to , such that we have for some . Thus by using our assumptions on , we have that for some . Therefore,
(10) |
because is finite (remembering ) and the condensation test tells us that is finite. We have thus shown that the last term in (9) converges to zero as , and thus that also does, giving the required result. ∎
We see that the ONMC approach converges provided that the increases at a logarithmic or faster rate. Though some sub-logarithmic schedules still convergence, not all do. For example, does not.
In the case where increases at a polynomial rate, we can further quantify the rate of convergence.
Corollary 1.
If each for some constants , then
(11) |
where is the -th generalized harmonic number of order .
Proof.
The result follows straightforwardly from combining our polynomial bound on the and (9), then substituting in for the resulting term. ∎
The function is strictly monotonically increasing for all and sublinear for all . The latter property means that our bound goes to zero, as required to not violate Theorem 1, while the former means that the convergence rate is slower than the of the NMC estimator. This slower rate is partially mitigated by the fact that the computation cost of ONMC is less than that of NMC. However, it turns out that this speed up is only a constant factor, such that the total computation cost of ONMC is if each , as per NMC. For NMC, this leads to a convergence rate of , which is minimized when , giving a rate of . For ONMC, we instead get the convergence rate
(12) |
For , converges to which is , giving an overall rate of . Now for any , the first term in (12) will ensure a worse rate than this. Similarly, for any , then the term will again give a worse rate (remembering that is monotonically increasing). We thus see that is also optimal for ONMC and that, with this setting, its convergence rate is a factor of worse than that of NMC.
To test ONMC empirically, we consider the simple analytic model given in Section 6.1 of Rainforth et al. (2017a), setting . The rationale for setting a minimum value of is that we want to minimize the burn-in effect of ONMC – earlier samples will be more biased than later samples and we can mitigate this by ensuring a minimum value for . More generally, we recommend setting (in the absence of other information) where is a minimum overall budget we expect to spend. In Figure 2, we have chosen to set deliberately low so as to emphasize the differences between NMC and ONMC. Given our value for , the ONMC approach is identical to fixing for , but unlike fixing , it continues to improve beyond this because it is not limited by asymptotic bias. Instead, we see an inflection point-like behavior around , with the rate recovering to effectively match that of the NMC estimator. We see that, for this example, the error of ONMC after using samples is around a factor of two of that of NMC, which we believe will typically be an acceptable performance drop given the substantial implementation simplifications it provides.
4.5 Using Online Nested Monte Carlo in PPSs
Using ONMC based estimation schemes to ensure consistent estimation for nested inference in PPSs is very straightforward – we can simply keep track of the number of iterations the outermost query has been run for, and use this to set the number of iterations used for the inner queries. In fact, we do not even need this minimal level of communication – can be inferred from the number of times we have previously run inference on the current query, the current depth , and .
When using properly weighted (Naesseth et al., 2015) methods such as importance sampling or SMC, inner queries can either return a full set of weighted samples each time they are called or generate this full set of weighted samples, but only return one drawn in proportion to the inner query weights . For the former case, the weights in the outer query are unchanged from (7), but is now a function of . For the latter, these weights are now where is the returned sample. This latter case is a strictly inferior estimator, but will often be necessary in practice because of the semantical complications associated with, for example, returning multiple samples from a single \sampledraw in Anglican. Returning to Figure 1, we see that using ONMC with nested importance sampling and only returning a single sample corrects the previous issues with how Anglican deals with nested inference, producing samples indistinguishable from the reference code.
Though one can use the results of Fort et al. (2017) to show the correctness of instead using an MCMC estimator for the outer query, the correctness of using MCMC methods for the inner queries is not explicitly covered by existing results. One would intuitively expect the ONMC results to carry over – as all the inner queries will run their Markov chains for an infinitely long time, thereby in principle returning exact samples – but we leave formal proof of this case to future work.
4.6 Discrete or Deterministic Input Variables
One special case where consistency can be maintained without requiring infinite computation for each nested call is when the variables passed to the inner query can only take on, say , finite possible values. Of particular note, is the case when only deterministic variables are passed to the inner query, corresponding to , which, for example, forms the theoretical basis for the “programs as proposals” approach of Cusumano-Towner and Mansinghka (2018). As per Theorem 5 of Rainforth et al. (2017a), we can rearranged such problems to separate estimators such that the standard Monte Carlo error rate can be achieved. This is perhaps easiest to see by noting that for such problems, can only on distinct values, leading to a separate, non nested, inference problem through enumeration. For repeated nesting, the rearrangement can be recursively applied until one achieves a complete set of non-nested estimators. To avoid inferior NMC convergence rates, this special case requires explicit rearrangement or a specialist consideration by the language back-end (as done by e.g. Stuhlmüller and Goodman (2012, 2014); Cornish et al. (2017)). For example, one can dynamically catch the inner query being called with the same inputs, e.g. using memoization, and then exploit the fact that all such cases target the same inference problem. Care is required in these approaches to ensure the correct combination with outer query, e.g. returning properly weighted samples and ensuring the budget of the inner queries remains fixed.
4.7 Exact Sampling
It may, in fact, be possible to provide consistent estimates for many nested query problems without requiring infinite computation for each nested call by using exact sampling methods such as rejection sampling or coupled Markov chains (Propp and Wilson, 1996). Such an approach is taken by Church (Goodman et al., 2008), wherein no sample ever returns until it passes its local acceptance criterion as a hierarchical rejection sampler. Church is able to do this because it only supports hard conditioning on events with finite probability, allowing it to take a guess-and-check process that produces an exact sample in finite time, simply sampling from the generative model until the condition is satisfied. Although the performance still clearly gets exponentially worse with nesting depth, this is a change in the constant factor of the computation, not its scaling with the number of samples taken: generating a single exact sample of the distribution has a finite expected time using rejection sampling which is thus a constant factor in the convergence rate.
Unfortunately, most problems require conditioning on measure zero events because they include continuous data – they require a soft conditioning akin to the inclusion of a likelihood term – and so cannot be tackled using Church. Constructing a practical generic exact sampler for soft conditioning in an automated way is likely to be insurmountably problematic in practice. Nonetheless, it does open up the intriguing prospect of a hypothetical system that provides a standard Monte Carlo convergence rate for nested inference. This assertion is a somewhat startling result: it suggests that Monte Carlo estimates made using nested exact sampling methods have a fundamentally different convergence rate for nested inference problems (though not nested estimation problems in general) than, say, nested self-normalized importance sampling (SNIS).
5 Nested Observation
An alternative way one might wish to nest queries is through nested observation, whereby we use the partition function estimate of one program to factor the trace probability of another. In its simplest form, the “observed” variables in the outer program are the inputs to the inner program. We can carry this out in Anglican using the following custom distribution object.
We can then “observe” another query by calling
which will generate a partition function estimate using the inner query and then use this to weight the trace probability of the outer query. This could, for example, be used to construct a PMMH sampler Andrieu et al. (2010).
Unlike the nested inference case, this approach turns out to be valid even if our budget is held fix, provided that the partition function estimate is unbiased, as is satisfied by, for example, importance sampling and SMC. In fact, it is important to hold the budget fixed to achieve a Monte Carlo convergence rate. In this nested observation case, we can define our target density as
(13) |
where is as before (except that we no longer have returned variables from the inner query) and is the true partition function of the inner query when given input . In practice, we cannot evaluate exactly, but instead produce unbiased estimates . Using an analogous self-normalized importance sampling to the nested inference case leads to the weights
(14) |
and corresponding empirical measure
(15) |
such that we are now conducting conventional Monte Carlo estimation, but our weights are now themselves random variables for a given due to the term. However, the weights are unbiased estimates of the “true weights” such that we have proper weighting (Naesseth et al., 2015) and thus convergence at the standard Monte Carlo rate, provided the budget of the inner query remains fixed. This result also follows directly from Theorem 6 of Rainforth et al. (2017b), which further ensures no complications arise when observing multiple queries if the corresponding partition function estimates are generated independently. These results further trivially extend to the repeated nesting case by recursion, while using the idea of pseudo-marginal methods (Andrieu and Roberts, 2009), the results also immediately extend to the case where the outermost query uses MCMC based inference.
Though, as demonstrated, there are no statistical complications with nested observation (other that the need for unbiased partition function estimates), there are potentially some semantical complications if we wish to carry out more advanced tasks. For example, we might desire to observe that the output of another query is a particular value. Unfortunately, this is not possible in general because the change of variables caused by deterministic mappings within the inner program often mean that it is not possible to directly evaluate the density of the outputs – we rely on the law of the unconscious statistician instead. One thing that is possible though is to condition on the output of a subset of the sampling statements returning particular values. This process was first carried out by Rainforth et al. (2016b), where they use a code transformation to generate an new program that generates the required partition functions estimates and then use this to construct marginal maximum a posteriori estimators and PMMH samplers. The idea has since been used in the Turing PPS to construct “composable” inference algorithms like PMMH (Ge et al., 2018). Our results confirm the statistical validity of these approaches.
6 Estimates as Variables
Our final case, is that one might wish to use estimates as first class variables in another query. In our previous examples, nonlinear usage of expectations only ever manifested through inversion of the normalization constant. These methods are therefore clearly insufficient for expressing more general nested estimation problems as would be required by, for example, experimental design. Using estimates as first class variable in a program allows, in principle, for the encoding of any nested estimation problem – using these estimates in the outer query gives rise to nonlinear mappings and expectations taken with respect to the outer query become nested estimation problem. Thus, inevitably, the validity of such approaches is equivalent to that of NMC (or ONMC depending on the approach taken) more generally and so must be set up appropriately. In particular, we need to ensure that the budgets used for the first class variables estimates increase as we take more samples of the outermost query.
One can implicitly use expectation estimates as first class variables in Anglican by either calling \doqueryinside a \defdistdeclaration or in a \defnfunction passed to a query using \cljwith-primitive-procedures, a macro providing the appropriate wrappings to convert a Clojure function to an Anglican one. An example of the later is shown in Appendix B, which provides generic Anglican code for carrying out experimental design problems.
7 Conclusions
We have formalized the notion of nesting probabilistic program queries and investigated the statistical validity of different categories of nesting. We have found that current systems tend to use methods that lead to asymptotic bias for nested inference problems, but that they are consistent for nested observation. We have shown the conditions required to remedy the former case and developed a new online estimator that simplifies the process of constructing algorithms that satisfy these conditions.
Acknowledgements
I would like to thank Yee Whye Teh for feedback on an earlier draft of the work. My research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071. Some of the work was undertaken while I was still at the Department of Engineering Science and was supported by a BP industrial grant.
Appendix A Case Study: Simulating a Poker Player
As a more realistic demonstration of the utility for allowing nested inference in probabilistic programs, we consider the example of simulating a poker player who reasons about another player. Anglican code for this example is given in Figure 3. In the interest of exposition, we consider the simple situation where both players are short-stacked, such that player 1’s (P1’s) only options are to fold or go all-in. Presuming that P1 has bet, the second player (P2) can then either fold or call. One could easily adapt the model to consider multiple players, additional betting options, and multiple rounds of betting (for which addition levels to the nesting might be required).
The two key functions here are the queries \cljp1-sim and \cljp2-sim, representing simulators for the two players. The simulations for each player exploit a utility function, which is a deterministic function of the two hands and actions taken, to reflect the preference for actions which give a bigger return. P1 acts before P2 and must thus not only reason about what hands P2 might have, but also the actions P2 might take. To do this, P1 utilizes the simulation for P2, which takes in as inputs an observed action from P1 and a true hand for P2, returning actions taken by P2. P2 has incomplete information – he does not know P1’s hand – and so the simulation of P2’s action requires inference, with a normalization constant that depends on P2’s hand and P1’s action, both of which are random variables in the outer query \cljp1-sim. This, therefore, corresponds to the nested inference class of nested estimation problems.
Keen-eyed readers may have noticed that this use of \conditionalin \cljp1-sim is distinct to elsewhere in the paper as we have explicitly used SMC inference with a provided number of particles for \conditional. This provides a roundabout means of controlling the computational budget for calls to \conditional, as we showed is required for convergence in Section 4. In Figure 6 we see that changing changes the probability of each player betting under the model. Increasing corresponds to P2’s calculation being carried out more carefully and so it is perhaps not surprising that it becomes less beneficial for P1 to bet with a weak hand as increases. Nonetheless, even for a large it remains beneficial for P1 to bet with less than average hands, reflecting the fact that money is already invested in the pot through the blinds and the opportunity to win through bluffing. The results for and are very similar, suggesting that they are close to the true solution.
Appendix B Experimental Design Example
An example application of using estimates as first class variables if provided by Bayesian experimental design (Chaloner and Verdinelli, 1995). Figure 7 shows generic Anglican code for carrying out Bayesian experimental design, providing a consistent means of carrying out this class of nested estimation problems. (Rainforth et al., 2017a, Figure 6) shows the convergence code equivalent to that of Figure 7 for a delay discounting model. This shows the convergence (or more specifically lack there of) in the case where is held fixed and the superior convergence achieved when exploiting the finite number of possible outputs to produce a reformulated, standard Monte Carlo, estimator. It therefore highlights both the importance of increasing the number of samples used by the inner query and exploiting our outlined special cases when possible.
Footnotes
- Even though the nested calls are made inside the conditioning predicate for these examples, the probabilistic semantics of Church means these behave as per nesting inference.
- It is important to also take the convention that the weight is zero whenever to avoid complications.
References
- C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, pages 697–725, 2009.
- C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 2010.
- G. Casella and C. P. Robert. Rao-blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
- K. Chaloner and I. Verdinelli. Bayesian experimental design: A review. Statistical Science, 1995.
- R. Cornish, F. Wood, and H. Yang. Efficient exact inference in discrete Anglican programs. 2017.
- M. F. Cusumano-Towner and V. K. Mansinghka. Using probabilistic programs as proposals. arXiv preprint arXiv:1801.03612, 2018.
- G. Fort, E. Gobet, and E. Moulines. MCMC design-based non-parametric regression for rare-event. application to nested risk computations. Monte Carlo Methods Appl, 2017.
- H. Ge, K. Xu, and Z. Ghahramani. Turing: a language for composable probabilistic inference. In AISTATS, 2018.
- N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for generative models. UAI, 2008.
- N. D. Goodman and A. Stuhlmüller. The Design and Implementation of Probabilistic Programming Languages. 2014.
- A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. Rajamani. Probabilistic programming. In Proceedings of the on Future of Software Engineering. ACM, 2014.
- R. Hickey. The Clojure programming language. In Proceedings of the 2008 symposium on Dynamic languages, page 1. ACM, 2008.
- L. J. Hong and S. Juneja. Estimating the mean of a non-linear function of conditional expectation. In Winter Simulation Conference, 2009.
- T. A. Le, A. G. Baydin, and F. Wood. Nested compiled inference for hierarchical reinforcement learning. In NIPS Workshop on Bayesian Deep Learning, 2016.
- V. Mansinghka, D. Selsam, and Y. Perov. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
- T. Mantadelis and G. Janssens. Nesting probabilistic inference. arXiv preprint arXiv:1112.3785, 2011.
- I. Murray, Z. Ghahramani, and D. J. MacKay. MCMC for doubly-intractable distributions. In UAI, 2006.
- C. A. Naesseth, F. Lindsten, and T. B. Schön. Nested sequential Monte Carlo methods. In ICML, 2015.
- L. Ouyang, M. H. Tessler, D. Ly, and N. Goodman. Practical optimal experiment design with probabilistic programs. arXiv preprint arXiv:1608.05046, 2016.
- M. Plummer. Cuts in bayesian graphical models. Statistics and Computing, 25(1):37–43, 2015.
- J. G. Propp and D. B. Wilson. Exact sampling with coupled markov chains and applications to statistical mechanics. Random structures and Algorithms, 1996.
- T. Rainforth, R. Cornish, H. Yang, and F. Wood. On the pitfalls of nested Monte Carlo. NIPS Workshop on Advances in Approximate Bayesian Inference, 2016a.
- T. Rainforth, T. A. Le, J.-W. van de Meent, M. A. Osborne, and F. Wood. Bayesian Optimization for Probabilistic Programs. In NIPS, pages 280–288, 2016b.
- T. Rainforth, R. Cornish, H. Yang, A. Warrington, and F. Wood. On the opportunities and pitfalls of nesting Monte Carlo estimators. arXiv preprint arXiv:1709.06181, 2017a.
- T. Rainforth, T. A. Le, J. W. van de Meent, M. A. Osborne, and F. Wood. Bayesian optimization for probabilistic programs. arXiv e-prints, arXiv:1707.04314, 2017b.
- D. Spiegelhalter, A. Thomas, N. Best, and W. Gilks. Bugs 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Cambridge, 1996.
- A. Stuhlmüller and N. D. Goodman. A dynamic programming algorithm for inference in recursive probabilistic programs. In Statistical Relational AI workshop at UAI, 2012.
- A. Stuhlmüller and N. D. Goodman. Reasoning about reasoning by nested conditioning: Modeling theory of mind with probabilistic programs. Cognitive Systems Research, 28:80–99, 2014.
- D. Tolpin, J.-W. van de Meent, and F. Wood. Probabilistic programming in Anglican. Springer, 2015.
- D. Tolpin, J.-W. van de Meent, H. Yang, and F. Wood. Design and implementation of probabilistic programming language Anglican. In Proceedings of the 28th Symposium on the Implementation and Application of Functional Programming Languages. ACM, 2016.
- F. Wood, J. W. van de Meent, and V. Mansinghka. A new approach to probabilistic programming inference. In AISTATS, pages 2–46, 2014.