Quantifying the probable approximation error of probabilistic inference programs
This paper introduces a new technique for quantifying the approximation error of a broad class of probabilistic inference programs, including ones based on both variational and Monte Carlo approaches. The key idea is to derive a subjective bound on the symmetrized KL divergence between the distribution achieved by an approximate inference program and its true target distribution. The bound’s validity (and subjectivity) rests on the accuracy of two auxiliary probabilistic programs: (i) a “reference” inference program that defines a gold standard of accuracy and (ii) a “meta-inference” program that answers the question “what internal random choices did the original approximate inference program probably make given that it produced a particular result?” The paper includes empirical results on inference problems drawn from linear regression, Dirichlet process mixture modeling, HMMs, and Bayesian networks. The experiments show that the technique is robust to the quality of the reference inference program and that it can detect implementation bugs that are not apparent from predictive performance.
A key challenge for practitioners of probabilistic modeling is the approximation error introduced by variational and Monte Carlo inference techniques. The Kullback-Leibler (KL) divergence [cover2012elements] between the result of approximate inference — i.e. the variational approximation, or the distribution induced by one run of the sampler — and the true target distribution is typically unknown. Predictive performance on a held-out test set is sometimes used as a proxy, but this need not track posterior convergence.
This paper introduces a new technique for quantifying the approximation error of a broad class of probabilistic inference programs, including variational and Monte Carlo approaches. The key idea is to derive a “subjective” bound on the symmetrized KL divergence between the distribution achieved by an approximate inference program and its true target distribution. The bound’s validity (and subjectivity) rests on beliefs about the accuracy of auxiliary probabilistic program(s). The first is a “reference” inference program that defines a gold standard of accuracy but that might be difficult to compute. When the original approximate inference program has a tractable output probability density, this is sufficient. If the output density of the approximate inference program is not available, then the technique also depends on the accuracy of a “meta-inference” program that answers the question “what internal random choices did the approximate inference program of interest probably make, assuming that it produced a particular result that was actually produced by the reference?” In Section 3.3 we relate this technique to some recent work.
The technique is implemented as a probabilistic meta-program for the Venture probabilistic programming platform [Mansinghka2014], written in the VentureScript language. The paper includes empirical results on inference problems drawn from linear regression, Dirichlet process mixture modeling, HMMs, and Bayesian networks. The experiments show that the technique is robust to the quality of the reference inference program and that it can detect implementation bugs that are not apparent from predictive performance.
2 Estimating subjective divergences
Kullback-Leibler (KL) divergences between an inference program’s approximating distribution and its target distribution are objective model-independent measures of the approximation error. However, tractable techniques for estimating KL divergences for approximate inference are lacking. This paper defines a quantity, subjective divergence, in terms of the following elements:
Model program : Samples latent variables and data for probabilistic model .
Data : A specific dataset which induces the posterior distribution .
Approximate inference program : Samples output from , which approximates , and also returns the history of the inference program execution that generated . An approximate inference program induces a weight function .
Reference inference program : Gold standard sampler that approximates the posterior . If the reference inference program is exact, so that for all , or equivalently , we call it an oracle.
Inference output marginal density estimators and : Estimators of marginal density for inference program output such that and for all and . and denote random variables. A realized estimate of induces a realized weight estimate. In all cases, expectations without a subscript are with respect to or .
Definition (Subjective Divergence)
If an oracle reference program is used (where for all ), then
This proposition is proven in Section 3. To construct inference output marginal density estimators and , we make use of a meta-inference program , which samples inference program execution history from an approximation to the conditional distribution given inference program output , such that the ratio of densities can be efficiently computed given , , and . The baseline estimator samples and produces a single sample importance sampling estimate: . The baseline estimator obtains from the history of the inference program execution that generated , and produces a single sample harmonic mean estimate: . A procedure for estimating subjective divergence using these baseline meta-inference based estimators is shown in Algorithm 1.
We have produced a VentureScript inference programming library that implements Algorithm 1. In our applications, the weight estimate computation can be performed incrementally within the inference program and meta-inference program, obviating the need for an explicit representation of inference program execution history and the separate weight computation illustrated in Figure 1.
The subjective divergence is based on estimating the symmetrized KL divergence in order to handle the fact that the posterior density is only available in unnormalized form (the symmetrized KL can be expressed purely in terms of unnormalized densities). We use a reference sampler as a proxy for a posterior sampler (accepting subjectivity) to address the challenge of Monte Carlo estimation with respect to the posterior for the term in the symmetrized KL. For inference programs with an output density that can be computed efficiently, such as mean-field variational families, the weight estimate in Algorithm 1 can be replaced with the true weight , and the subjective divergence is equivalent to the symmetrized KL divergence when an oracle reference is used. For inference programs with a large number of internal random choices , the densities on outputs are intractable to compute, and Algorithm 1 uses meta-inference to construct marginal density estimators and such that Proposition 2 holds. Subjective divergence can be interpreted as approximately comparing samples from the inference program of interest to gold standard samples through the lens of the log-weight function .
3 Analyzing subjective divergences
Having defined subjective divergence and a procedure for estimating it, we now prove Prop. 2 using bounds on the expected log estimated weight taken under the inference program of interest () and the expected log estimated weight taken under the reference program (). The expectation under the inference program of interest is less than the log normalizing constant by at least :
(derivation based on Jensen’s inequality in Appendix C). Note that this constitutes a lower bound on the “ELBO” variational objective. The expectation under an oracle reference program is greater than the log normalizing constant by at least :
(derivation based on Jensen’s inequality in Appendix C). The subjective divergence is the difference between the expectation under the reference program (bounded in Lemma 3) and the expectation under the inference program of interest (bounded in Lemma 3). Taking the difference of the bounds cancels the terms and proves Proposition 2. Relationships between key quantities in the proof are illustrated in Figure 1(c). By Proposition 2, if an oracle reference program is available, we can estimate an upper bound on the symmetrized KL divergence by estimating a subjective divergence.
3.1 Effect of quality of reference inference program
If the reference inference program is not an oracle, it is still possible to retain the validity of subjective divergence as an upper bound, depending on the accuracy of the reference program:
3.2 Effect of quality of meta-inference program
In the setting of an oracle reference program and the baseline marginal density estimators and used in Algorithm 1, the quality of the meta-inference determines the tightness of the upper bound of Proposition 2. In particular, the gap between the true symmetrized KL divergence and the subjective divergence is the symmetrized conditional relative entropy [cover2012elements] (derivation in Appendix D)
which measures how closely the meta-inference approximates , the conditional distribution on inference execution histories given inference output . Note that if we had exact meta-inference and could compute its density, the weight estimate simplifies to , and we could remove this gap. More generally, the gap is due to the biases of the estimators for that are induced by taking the of the estimates of produced by and , which are related to the variances of and . For example, compare the variance of the baseline with the bias of the induced estimator of :
where is the Pearson chi-square divergence [nielsen2013chi]. The bias of the estimator of manifests in the second term in Equation 1. 111Improving upon the baseline inference output marginal density estimators and reducing the gap between subjective divergence and symmetric KL divergence seems a promising direction for future work. See Appendix D for details.
3.3 Related work
In [grosse2015sandwiching] the authors point out that unbiased estimators like and unbiased reciprocal estimators like estimate lower and upper bounds of the log-estimand respectively, which they use to estimate lower and upper bounds on . [grosse2015sandwiching] also suggests combining stochastic upper bounds on , obtained by running reversed versions of sequential Monte Carlo (SMC) algorithms starting with an exact sample obtained when simulating data from the model, with lower bounds on the ELBO, to upper bound KL divergences. The authors of [DBLP:conf/icml/SalimansKW15] introduce a general auxiliary variable formalism for estimating lower bounds on the ELBO of Markov chain inference, which is equivalent to estimation of our expected log estimated weight under the inference program for the baseline estimator applied to Markov chains.
We used the VentureScript implementation of Algorithm 1 to estimate subjective divergence profiles for diverse approximate inference programs applied to several probabilistic models.
In addition to applying the technique to mean-field variational inference, where the output density is available, we derived meta-inference programs for two classes of inference programs whose density is generally intractable: sequential inference utilizing a Markov chain of detailed-balance transition operators and particle filtering in state space models. For sequential inference, we use a coarse-grained representation of the inference execution history that suppresses internal random choices made within segments of the Markov chain that satisfy detailed balance with respect to a single distribution. The meta-inference program is also sequential detailed-balance inference, but with the order of the transition operators reversed. This reversed Markov chain is an instance of the formalism of [DBLP:conf/icml/SalimansKW15], was used to construct annealed importance sampling (AIS) [Neal2001], and was sampled from in [burda2014accurate] and [grosse2015sandwiching]. The weight estimate corresponds to the AIS marginal likelihood estimate. The subjective divergence for standard non-sequential MCMC can be analyzed using this construction, but results in a trivial upper bound on the KL divergence due to the failure of the approximating assumptions used to derive the meta-inference program. For particle filtering in state space models, we use the conditional SMC (CSMC) update [andrieu2010particle] and the weight estimate is the marginal likelihood estimate of the particle filter. It is intuitive that we use CSMC to answer “how might have a particle filter produced a given particle?” A special case of the particle filter is sampling importance resampling, for which the meta-inference program (shown in Figure 0(e)) places the output sample in one of particles, and samples the remaining particles from the prior. See Appendix E for derivations.
4.1 Linear regression
We first considered a small Bayesian linear regression problem, with unknown intercept and slope latent variables (model program shown in Figure 0(d)), and generated subjective divergence profiles for sampling-based and variational inference programs (shown in Figure 0(b) and Figure 0(c)) using an oracle reference. We estimated profiles for two black box mean-field [DBLP:conf/aistats/RanganathGB14] programs which differed in their choice of variational family—each family had a different fixed variance for the latents. We varied the number of iterations of stochastic gradient descent to generate the profiles, which exhibited distinct nonzero asymptotes. We also estimated profiles for two sequential inference programs that consist of alternating between observing an additional data point and running a transition operator that targets the partial posterior for with data points. One program used repeated application of Metropolis-Hastings (MH) transitions with a resimulation (prior) proposal within each and the other used applications of a random-walk MH transition. We varied the number of applications within each of the primitive MH transition operator. The profile based on resimulation MH converged more rapidly. Finally, we produced a subjective divergence profile for a likelihood-weighting sampling importance resampling (LW-SIR) inference program by varying the number of particles. LW-SIR was the only algorithm applied to this problem whose subjective divergence profile converged to zero. 222The profiles for the sequential detailed balance inference scheme converge to the sum of symmetrized KL divergences between consecutive partial posteriors . See Appendix E.2 for details.
4.2 Bayesian networks
We estimated subjective divergence profiles for approximate inference programs applied to a noisy-or Bayesian network (subset shown in Figure 3(a)). The network contained 25 latent causes, and 35 findings, with prior cause probabilities of 0.001, transmission probabilities of 0.9, and spontaneous finding activation probabilities of 0.001, with edges sampled uniformly with probability 0.7 of presence. All findings were active. We compared four sequential inference programs that all advanced through the same sequence of target distributions defined by gradually lowering the finding spontaneous activation probability from 0.99 to the true model value 0.001 across 10 equal-length steps, but applied distinct types of transition operators at each step. We compared the use of a single-site resimulation MH operator, a block resimulation MH operator, single-site Gibbs operator, and block Gibbs operator as primitive operators within each for , and varied the number of applications of each primitive operator within each to generate the profiles, shown in Figure 3(c). For the reference program we used sequential inference with four applications of block Gibbs between each target distribution step. Inference for this problem is hard for single-site Gibbs operators due to explaining away effects, and hard for resimulation-based operators due to the low probability of the data under the prior. The resimulation MH based profiles exhibited much slower convergence than those of the Gibbs operators.
4.3 Hidden Markov models
We next applied the technique to a hidden Markov model (HMM) with discrete state and observation space (40 time steps, 2 hidden states, 3 observation states), and produced subjective divergence profiles for two particle filter inference programs with prior (forward simulation) and conditional proposals. Both particle filters used independent resampling. We used exact forward-filtering backwards sampling for the reference inference program. The profiles with respect to the number of particles are shown in Figure 3(d). The conditional proposal profile exhibits faster convergence as expected. Note that for the single particle case there are no latent random choices in these the particle filters, and the subjective divergence is the symmetrized KL divergence.
4.4 Detecting an ergodicity violation in samplers for Dirichlet process mixture modeling
We estimated subjective divergence profiles (Figure 4(a)) for sequential inference programs in an uncollapsed Dirichlet process mixture model (DPMM) with data points simulated from the model program, with partial posteriors for for the sequence of target distributions. For the reference, we used a relatively trusted sequential inference program based on Venture’s built-in single-site resimulation MH implementation. We estimated subjective divergence profiles for inference based on the single-site resimulation MH operator and for inference based on a cycle operator consisting of single-site Gibbs steps for the latent cluster assignments, and resimulation MH for global parameters. The subjective divergence of the Gibbs/MH operator exhibited anomalous behavior, and degraded with additional inference, quickly becoming worse than the resimulation MH operator. This led us to identify a bug in our Gibbs/MH operator in which no inference was being performed on the within-cluster variance parameter. The profile for the corrected operator exhibited markedly faster convergence than the resimulation MH profile. For comparison, we estimated the expected log likelihood for output samples produced at the termination of these inference programs. The expected log likelihood profile (Figure 4(b)) for the Gibbs/MH operator with a bug was significantly higher (better) than the profile for resimulation MH, despite being significantly poorer than the profile in the corrected version. Note that unlike the subjective divergence profiles, the expected log likelihood profiles for the operator with a bug may not have seemed anomalous.
This paper introduced a new technique for quantifying the approximation error of a broad class of probabilistic inference programs. The key ideas are (i) to assess error relative to subjective beliefs in the quality of a reference inference program, (ii) to use symmetrized divergences, and (iii) to use a meta-inference program that finds probable executions of the original inference program if its output density cannot be directly assessed. The approach is implemented as a probabilistic meta-program in VentureScript that uses ancillary probabilistic meta-programs for the reference and meta-inference schemes.
Much more empirical and theoretical development is needed. Specific directions include better characterizing the impact of reference and meta-inference quality and identifying the contexts in which the theoretical bounds are predictably tight or loose. Applying the technique to a broad corpus of VentureScript programs seems like a useful first step. Empirically studying the behavior of subjective divergence for a broader sample of buggy inference programs also will be informative.
It also will be important to connect the approach to results from theoretical computer science, including the computability [ackerman2010computability] and complexity [freer2010probabilistic] of probabilistic inference. For example, the asymptotic scaling of probabilistic program runtime can be analyzed using the standard random access memory model [thomas2001introduction] under suitable assumptions about the implementation. This includes the model program; the inference program; the reference program; the meta-inference program; and the probabilistic meta-program implementing Algorithm 1. It should thus be possible to align the computational tractability of approximate inference of varying qualities with standard results from algorithmic and computational complexity theory, by combining such an asymptotic analysis with a careful treatment of the variances of all internal Monte Carlo estimators.
This technique opens up other new research opportunities. For example, it may be possible to predict the probable performance of approximate inference by building probabilistic models that use characteristics of problem instances to predict subjective divergences. It may also be possible to use the technique to justify inference heuristics such as [obermeyer2014scaling] and [balakrishnan2006one], and the stochastic Bayesian relaxations from [DBLP:journals/corr/MansinghkaSJPGT15], [mansinghka2013approximate]. Finally, it seems fruitful to use the technique to study the query sensitivity of approximate inference [russell_personal_communication].
Practitioners of probabilistic modeling and inference are all too familiar with the difficulties that come with dependence on approximation algorithms, especially stochastic ones. Diagnosing the convergence of sampling schemes is known to be difficult in theory [diaconis2009markov] and in practice [Cowles1996]. Many practitioners respond by restricting the class of models and queries they will consider. The definition of “tractable” is sometimes even taken to be synonymous with “admits polynomial time algorithms for exactly calculating marginal probabilities”, as in [poon2011sum]. Probabilistic programming throws these difficulties into sharp relief, by making it easy to explore an unbounded space of possible models, queries, and inference strategies. Hardly any probabilistic inference programs come with certificates that they give exact answers in polynomial time.
It is understandable that many practitioners are wary of expressive probabilistic languages. The techniques in this paper make it possible to pursue an alternative approach: use expressive languages for modeling and potentially even also stochastic inference strategies, but also build quantitative models of the time-accuracy profiles of approximate inference, in practice, from empirical data. This is an inherently subjective process, involving qualitative and quantitative assumptions at the meta-level. However, we note that probabilistic programming can potentially help manage this meta-modeling process, providing new probabilistic—or in some sense meta-probabilistic—tools for studying the probable convergence profiles of probabilistic inference programs.
The authors would like to thank Ulrich Schaechtle and Anthony Lu for testing the technique, and David Wingate, Alexey Radul, Feras Saad, and Taylor Campbell for helpful feedback and discussions. This research was supported by DARPA (PPAML program, contract number FA8750-14-2-0004), IARPA (under research contract 2015-15061000003), the Office of Naval Research (under research contract N000141310333), the Army Research Office (under agreement number W911NF-13-1-0212), and gifts from Analog Devices and Google. \printbibliography
Appendix A Basic notation
The notation is used to denote the distribution of a random variable, as well the corresponding probability density function, and we rely on the context to disambiguate between the two. In particular, the KL divergence from probability distribution to probability distribution is denoted :
where the and inside the expectation are density functions which take values as input, and indicates a random variable with distribution . Throughout, when comparing two distributions and we assume that they have equal support (). The symmetrized KL divergence between and is
In this appendix, we use the shorthand for , and for when there is no ambiguity as to the distributions represented by and .
Appendix B Deriving the subjective divergence
This section provides a pedagogical derivation of subjective divergence. Suppose we seek to estimate the KL divergence between two distributions and (in this section we do not initially assume these to be approximate inference or posterior distributions in particular). We walk through a motivating derivation of the subjective divergence as an approach to this problem.
b.1 Monte Carlo estimation
Suppose we can sample from and , and that normalized densities of and are available. Then, we can estimate either direction of KL divergence using simple Monte Carlo, e.g.:
where . The accuracy of the estimates is determined by the variance in the log weight () and .
b.2 Symmetrized KL divergence
Suppose now that only unnormalized densities and can be computed with unknown normalizing constants and , but that we can still sample from and . Then the two directions of KL divergence are:
Suppose we can accurately estimate the expectation terms for both of these quantities using simple Monte Carlo, but that estimating the terms and is more difficult.
Consider the direction . Estimating only the expectation term allows us to estimate differences in KL divergence or if the normalizing constants and are the same. The ‘evidence lower bound’ (ELBO) optimized in variational inference is such an expectation, in which often . The ELBO is used to guide a search or optimization process over a space of to minimize . However, not knowing the normalizing constant prevents us from estimating the KL divergence itself.
Note that in the symmetrized KL divergence, the terms containing the normalizing constants cancel, and we are left with:
where we define the unnormalized weight function as . Suppose we use a simple Monte Carlo estimator for each of the two expectations in the above expression of the symmetric KL divergence by sampling from and respectively, and take the difference in estimates. This can be interpreted as comparing samples from against samples from by projecting them through the log-weight function onto .
b.3 Non-oracle reference inference program
We now refine the setting to more closely match the approximate inference setting, in which it is relatively easy to sample from , and difficult to sample from . Specifically, we assume that the term is relatively easier to estimate than . This is often the case, for example, if is a posterior distribution and is the approximating distribution of a typical inference program. We consider using samples from a proxy instead of samples from , for which is more efficient to sample from than itself, but otherwise using the original weight function which is defined in terms of and . Instead of the symmetric KL divergence between and we are then estimating:
The difference between our expectation and the true symmetrized KL is:
b.4 Inference program output marginal density estimators
We now handle the setting in which the density is not available, even up to a normalizing constant, due to the presence of internal random choices involved in sampling from :
where is high-dimensional. We take to be an inference program, and we refer to as an inference execution history. Note that unlike in the main text, the dependence on the data set is omitted in the notation of this section. When and are jointly sampled from by first sampling followed by , is the history of the inference program execution that generated . Consider the symmetrized KL divergence:
We will construct a Monte Carlo estimate of the symmetrized KL divergence that uses estimators , which are potentially stochastic given , instead of the true densities :
for for and for . The expectation of the estimate is:
where the inner expectations are with respect to the distributions of the random variables conditioned on . We want the expectation of our estimate to be an upper bound on the true symmetrized KL divergence. To enforce this, we choose distinct estimators for , denoted and respectively, for use with the samples and for use with the samples such that the following two conditions hold:
This will ensure that the expectation of our estimate is greater than the symmetrized KL divergence. To achieve this, we require that:
This is equivalent to the requirement that:
There are potentially many choices for and that satisfy these conditions. To construct the baseline estimators we assume that we can efficiently compute the joint density . For the estimator we use an importance sampling estimator with importance distribution where
satisfies the unbiasedness condition of Equation 27 for . To construct the estimator we note that
We define as a harmonic mean estimator:
which satisfies the unbiased reciprocal condition of Equation 28 for . Algorithm 1 uses for both and , and obtains the sample of inference program execution history from the joint sample that generated . Note that only one such sample is immediately available for each , although we could conceivably start a Markov chain at the exact sample with as its stationary distribution to obtain more samples marginally distributed according to . Using more sophisticated versions of and is left for future work. Note that for the single-particle baseline estimators and an oracle reference, the sole determiner of the gap between the subjective divergence and the symmetrized KL is the quality of the distribution as an approximation to . We refer to as the meta-inference distribution.
Appendix C Proofs
For such that for all ,
|Factoring out the normalizing constant using :|
|Linearity of expectation:|
|The normalizing constant is a constant:|
|Linearity of expectation:|
|Conditioned on , is a constant:|