Discrete-Continuous Mixtures in Probabilistic Programming:Generalized Semantics and Inference Algorithms

Discrete-Continuous Mixtures in Probabilistic Programming:
Generalized Semantics and Inference Algorithms

Yi Wu    Siddharth Srivastava    Nicholas Hay    Simon Du    Stuart Russell

On Discrete-Continuous Mixtures in Probabilistic Programming:
Generalized Semantics and Inference Algorithms
Supplementary Materials

Yi Wu    Siddharth Srivastava    Nicholas Hay    Simon Du    Stuart Russell
Abstract

Despite of the recent successes of probabilistic programming languages (PPLs) in AI applications, PPLs offer only limited support for discrete-continuous mixture random variables. We develop the notion of measure-theoretic Bayesian networks (MTBNs), and use it to provide more general semantics for PPLs with arbitrarily many random variables defined over arbitrary measure spaces. We develop two new general sampling algorithms which are provably correct under the MTBN framework: lexicographic likelihood weighting (LLW) for general MTBNs and lexicographic particle filter (LPF), a specialized algorithm for state space models. We further integrate MTBN into a widely used PPL system, BLOG, and verify the effectiveness of our new inference algorithms through representative examples.

Probabilistic Programming Language, Discrete-Continuous Mixtures, Importance Sampling

1 Introduction

As originally defined by (Pearl, 1988), Bayesian networks express joint distributions over finite sets of random variables as products of conditional distributions. Probabilistic programming languages (PPLs) (Koller et al., 1997; Milch et al., 2005a; Goodman et al., 2008; Wood et al., 2014b) apply the same idea to potentially infinite sets of variables with general dependency structures. Thanks to its expressiveness power, PPLs have been used to solve many real-world applications, including Captcha (Le et al., 2017), seismic monitoring (Moore & Russell, 2017), 3D pose estimation (Kulkarni et al., 2015), generating design suggestions (Ritchie et al., 2015), concept learning (Lake et al., 2015) and cognitive science applications (Stuhlmüller & Goodman, 2014).

A major drawback of existing PPLs is that they can only support discrete and continuous random variables but not their mixtures. In practical applications, we often have to deal with a mixture of continuous and discrete random variables. Combination of discrete and continuous distributions are ubiquitous in practical applications: sensors that have thresholded limits, e.g. thermometers, weighing scales, speedometers, pressure gauges; or a hybrid sensor that can report a either real value or an error condition. This kind of random variables has also been studied in many other applications from a wide range of scientific domains (Kharchenko et al., 2014; Pierson & Yau, 2015; Gao et al., 2017).

Many PPLs have a restricted syntax which forces the expressed random variables to be either discrete or continuous (Goodman & Stuhlmüller, 2014; Tran et al., 2016; Pfeffer, 2009; Carpenter et al., 2016). Even for those PPLs which support mixtures of discrete and continuous variables by its syntax (Milch et al., 2005a; Goodman, 2013; Mansinghka et al., 2014; Wood et al., 2014a), the underlying semantics of these PPLs implicitly assume the random variables are not mixtures. Moreover, the inference algorithm associated with the semantics inherit the same assumption. If one defines a random variable which is a discrete-continuous mixture in such a PPL and directly adopts its default inference algorithm, the produced result can be wrong.

Consider the following GPA example: a two-variable Bayes net where the nationality follows a binary distribution

and the conditional probabilities are discrete-continuous mixtures

This is a typical scenario in practice because many top students have perfect GPAs. Now suppose we observe a student with a GPA of 4.0. Where do they come from? If the student is Indian, the probability of any singleton set where is zero, as this range has a probability density. On the other hand if the student is American, the set has the probability . Thus, by Bayes theorem, , which means the student must be from the USA.

However, if we run the default Bayesian inference algorithm for this problem in PPLs, e.g., the standard importance sampling algorithm (Milch et al., 2005b), a sample that picks India receives a density weight of , whereas one that picks USA receives a discrete-mass weight of . Since the algorithm does not distinguish probability density and mass, it will conclude that the student is very probably from India, which is far from the truth.

We can fix the GPA example by considering a density weight infinitely smaller than a discrete-mass weight (Nitti et al., 2016; Tolpin et al., 2016). However, the situation becomes more complicated when involving more than one evidence variables, e.g., GPAs over multiple semesters for students who may study in both countries. Vector-valued variables also cause problems—does a point mass in three dimensions count more or less than a point mass in two dimensions? These practical issues require PPL researchers to accomplish the following two tasks.

  • Inherit all the properties of existing PPL semantics111(1) Random variables with infinitely (even uncountably) many parents; (2) Establishment of conditional independencies implied by an infinite graph; and (3) Open-universe semantics in terms of the possible worlds in the vocabulary of the model. and extend it to handle general discrete-continuous mixtures;

  • Design provably correct inference algorithms for the extended semantics.

In this paper, we provide a general framework for the extended PPL semantics and propose new algorithms based on this framework to handle mixture of discrete and continuous variables. Notably, our generalized framework can be also applied to other general measure spaces beyond discrete-continuous mixtures.

1.1 Main Contributions

Measure Theoretical Bayesian Nets

Measure theory can be applied to handle the discrete-continuous mixtures or even more abstract measures. In this paper, we bring the measure-theoretical machinery to Bayesian nets and propose measure theoretical Bayesian nets (MTBNs). We give formal definitions of MTBNs and more importantly, we show theoretically, every MTBN represents a unique measure on the input space. Therefore, we have a sound theoretical foundation and can then use MTBN to provide generalized semantics for any PPL.

Inference Algorithms

Unlike discrete or continuous random variables, the existence of discrete-continuous mixtures often requires a user to incorporate specialized tricks for different models on top of classical PPL inference algorithms. Based on MTBN, we propose a general and provably correct inference algorithm, lexicographic likelihood weighting (LLW). In addition, we further adapt LLW into the sequential Monte Carlo (SMC) framework for state space models.

Incorporating MTBN into an existing PPL

We incorporate MTBNs to a widely used PPL, Bayesian Logic (BLOG) (Milch et al., 2005a). With simple modifications, we define the generalized BLOG language, measure-theoretic BLOG, which formally supports arbitrary distributions, including discrete-continuous mixtures. We prove that every generalized BLOG model corresponds to a unique MTBN. Thus, all the desired theoretical properties of MTBN can be carried to measure-theoretic BLOG. We also implement LLW and LPF algorithms in the backend of measure-theoretic BLOG and use three representative examples to show their effectiveness.

1.2 Organization

This paper is organized as follows. We first discuss related work in Section 2. In Section 3, we formally define measure-theoretic Bayesian nets and study its theoretical properties. In Section 4, we propose a general inference algorithm, lexicographic likelihood weighting (LLW) for general MTBN and its variant, lexicographic particle filter (LPF), which is specialized for state space models. In Section 5, we introduce the measure-theoretic extension of BLOG and study its theoretical foundations for defining probabilistic models. In Section 6, we empirically validate the generalized BLOG system and the new inference algorithms on three representative examples. We conclude in Section 7 and defer most technical details to appendix.

2 Related Work

The motivating GPA example has been also discussed as a special case in many other PPL systems, such as (Tolpin et al., 2016) and (Nitti et al., 2016). These systems proposed different specialized tricks to derive the correct answer particularly for this example but do not provide any general correctness guarantees. On the other hand, our MTBN framework is much more general and LLW inference algorithm is provably correct.

The closest related work to our framework is by Milch (2006), who utilize a supportive numbering of random variables, implying that each random variable has finitely many consistent parents. In addition, they only handle random variables with countably infinite ranges. The BLP framework presented by Kersting & De Raedt (2007) unifies logic programming with probability models, but requires each random variable to be influenced by a finite set of random variables in order to define the semantics. This amounts to requiring only finitely many ancestors of each node. Choi et al. (2010) present an algorithm for carrying out lifted inference over models with purely continuous random variables. They also require parfactors to be functions over finitely many random variables, thus limiting the set of influencing variables for each node to be finite. Gutmann et al. (2011a) also define densities over finite dimensional vectors. In a relatively more general formulation (Gutmann et al., 2011b) define the distribution of each random variable using a definite clause, which corresponds to the limitation that each random variable (either discrete or continuous) has finitely many parents. Frameworks building on Markov networks also have similar restrictions. Wang & Domingos (2008) only consider networks of finitely many random variables, which can have either discrete or continuous distributions. Singla & Domingos (2007) extend Markov logic to infinite (non-hybrid) domains, provided that each random variable has only finitely many influencing random variables.

In contrast, our approach not only allows models with arbitrarily many random variables with mixed discrete and continuous distributions, but each random variable can also have arbitrarily many parents as long as all ancestor chains are finite (but unbounded). The presented work constitutes a rigorous framework for expressing probability models with the broadest range of cardinalities (uncountably infinite parent sets) and nature of random variables (discrete, mixed, and even arbitrary measure spaces), with clear semantics in terms of first-order possible worlds and the generalization of conditional independences on such models.

Lastly, there are also works analyzing the theoretical properties of PPL semantics from measure theoretical perspectives (Shan & Ramsey, 2017; Staton, 2017; Ramsey, 2016).

3 Measure-Theoretic Bayesian Networks

In this section, we introduce measure-theoretic Bayesian nets (MTBNs) and prove that a MTBN represents a unique measure with desired theoretical properties. We assume familiarity with measure-theoretic approaches to probability theory. Some background is included in Sec. A.

We begin with some necessary definitions of graph theory.

Definition 3.1.

A digraph is a pair of a set of vertices , of any cardinality, and a set of directed edges . Write if , and if there is a path from to in .

Definition 3.2.

A vertex is a root vertex if there are no incoming edges to it, i.e., no such that . Let denote the set of parents of a vertex , and denote its set of non-descendants.

Definition 3.3.

A well-founded digraph is one with no countably infinite ancestor chain .

This is the natural generalization of a finite directed acyclic graph to the infinite case. Now we are ready to give the key definition of this paper.

Definition 3.4.

A measure-theoretic Bayesian network consists of (a) a well-founded digraph of any cardinality, (b) an arbitrary measurable space for each , and (c) a probability kernel from to for each .

By definition, MTBNs allow us to define very general and abstract models with the following two major benefits:

  1. We can define random variables with infinitely (even uncountably) many parents because MTBN is defined on a well-founded digraph.

  2. We can define random variables in arbitrary measure spaces (with as one case) distributed according to any measure (including discrete, continuous and mixed).

Next, we related MTBN to a probability measure. Fix an MTBN . For let be the product measurable space over variables . With this notation, is a kernel from to . Whenever let denote the projection map. Let be our base measurable space upon which we will consider different probability measures . Let for denote both the underlying set of and the random variable given by the projection , and for the underlying space of and the random variable given by the projection .

Definition 3.5.

A MTBN represents a measure on , if for all :

  • is conditionally independent of its non-descendants given its parents .

  • holds almost surely for any , i.e., is a version of the conditional distribution of given its parents.

Def. 3.5 captures the generalization of the local properties of Bayes Nets – conditional independence and conditional distributions defined by parent-child relationships. Here we implicit assume the conditional probability exists and is unique. This is a mild condition because this holds as long as the probability space is regular (Kallenberg, 2002).

The next theorem shows our definition of MTBN is proper.

Theorem 3.6.

A MTBN represents a unique measure on .

Theorem 3.6 lays out the foundation of MTBN. Its proof requires a series of intermediate results. We first define a projective family of measures. This gives a way to recursively construct our measure . We then define a notion of consistency such that every consistent projective family constructs a measure that represents. Lastly, we give an explicit characterization of the unique consistent projective family, and thus of the unique measure represents. The full proof is in Appx. B.

4 Generalized Inference Algorithms

We introduce the lexicographic likelihood weighting (LLW) algorithm for provably correct inference on MTBNs. We also present lexicographic particle filter (LPF) for state space models by adapting LLW into the SMC framework.

4.1 Lexicographic likelihood weighting

Suppose we have a MTBN with finitely many random variables , and that, without loss of generality, we observe real-valued random variables for as evidence. Suppose the distribution of given its parents is a mixture between a density with respect to Lebesgue and a discrete distribution , i.e., for any , we have Note this implies is nonzero for at most countably many values . If is nonzero for finitely many points, it can be represented by a list of those points and their values.

Lexicographic Likelihood Weighting (LLW) extends the classical likelihood weighting (Milch et al., 2005b) to this setting. It visits each node of the graph in topological order, sampling those variables that are not observed, and accumulating a weight for those that are observed. In particular, at an evidence variable we update a tuple of the number of densities and a weight, initially , by:

(1)

Finally, having samples by this process and accordingly a tuple for each sample , let and estimate by

(2)

The algorithm is summarised in Alg. 1 The next theorem shows this procedure is consistent.

Theorem 4.1.

LLW is consistent: (2) converges almost surely to .

densities , masses , evidences , and . sample all the ancestors of from prior compute by Eq. (\ENDFOR\STATE) Return
\REQUIRE\FOR\STATE\STATE\STATE
Algorithm 1 Lexicographic Likelihood Weighting

In order to prove Theorem 4.1, the main technique we adopt is to use a more restricted algorithm, the Iterative Refinement Likelihood Weighting (IRLW) as a reference.

4.1.1 Iterative refinement likelihood weighting

Suppose we want to approximate the posterior distribution of an -valued random variable conditional on a -valued random variable , for arbitrary measure spaces and . In general, there is no notion of a probability density of given for which to weight samples. If, however, we could make a discrete approximation of then we could weight samples by the probability . If we increase the accuracy of the approximation with the number of samples, this should converge in the limit. We show this is possible, if we are careful about how we approximate:

Definition 4.2.

An approximation scheme for a measurable space consists of a measurable space and measurable approximation functions for and for such that and can be measurably recovered from the subsequence for any .

When is a real-valued variable we will use the approximation scheme where denotes the ceiling of , i.e., the smallest integer no smaller than it. Observe in this case that which we can compute from the CDF of .

Lemma 4.3.

If are real-valued random variables with , then .

Proof.

Let be the sigma algebra generated by . Whenever we have and so . This means is a martingale, so we can use martingale convergence results. In particular, since

where is the sigma-algebra generated by (see Theorem 7.23 in (Kallenberg, 2002)).

is a measurable function of the sequence , as , and so . By definition the sequence is a measurable function of , and so , and so giving our result. ∎

Iterative refinement likelihood weighting (IRLW) samples from the prior and evaluates:

(3)

Using Lemma 4.3, G.12, and G.13, we can show IRLW is consistent.

Theorem 4.4.

IRLW is consistent: (3) converges almost surely to .

4.1.2 Proof of Theorem 4.1

Now we are ready to prove Theorem 4.1.

Proof of Theorem 4.1.

We only prove where the evidence variables are leaves. It is straightforward to extend the proof when the evidence variables are non-leaf nodes. Let be a sample produced by the algorithm with number of densities and weight . With a -cube around we have

Using as an approximation scheme by Def. 4.2, the numerator in the above limit is the weight used by IRLW. But given the above limit, using as the weight will give the same result in the limit. Then if we have samples, in the limit of only those samples with minimal will contribute to the estimation, and up to normalization they will contribute weight to the estimation. ∎

4.2 Lexicographic particle filter

Likelihood weighting based algorithms suffer from curse of dimensionality. One important class of models with high dimensionality are state space models. A state space model (SSM) consists of latent states and the observations with a special dependency structure where and for .

Sequential Monte Carlo (SMC) (Doucet et al., 2001), i.e., particle filer, is a widely adopted class of methods for inference on SSMs. Given the observed variables , the posterior distribution is approximated by a set of particles where each particle represents a sample of . Particles are propagated forward through the transition model and resampled at each time step according to the weight of each particle, which is defined by the likelihood of observation .

In the MTBN setting, the distribution of given its parent can be a mixture of density and a discrete distribution . Hence, the resampling step in particle filter should be accordingly modified: following the idea from LLW, when computing the weight of a particle, we enumerate all the observations 222There can be multiple variables observed. Here the notation denotes for conciseness. at time step and again update a tuple , initially (0,1), by

(4)

We discard all those particles with a non-minimum value and then perform the normal resampling step. We call this algorithm lexicographical particle filter (LPF), which is summarized in Alg. 2.

densities , masses , evidences , and sample from transition compute by Eq. 4, Output resample particles according to
\REQUIRE\FOR\FOR\STATE\STATE\ENDFOR\STATE\STATE\STATE\STATE\ENDFOR
Algorithm 2 Lexicographic Particle Filter (LPF)

The following theorem guarantees the correctness of LPF. Its Proof easily follows the analysis for LLW and classical proof from importance sampling to particle filter.

Theorem 4.5.

LPF is consistent: the outputs of Alg. 2 converges almost surely to .

5 Generalized Probabilistic Programming Languages

In Section 3 and Section 4 we provided the theoretical foundation of MTBN and general inference algorithms. This section describes how to incorporate MTBN into a practical PPL. Here our focus is a widely used open-universe PPL, BLOG (Milch, 2006). We define the generalized BLOG language, the measure-theoretic BLOG, and prove that every well-formed measure-theoretic BLOG model corresponds to a unique MTBN. Note that our approach also applies to other PPLs333It has been shown that BLOG has equivalent semantics to other PPLs (Wu et al., 2014; McAllester et al., 2008)..

We begin with a brief description of the core syntax of BLOG, with particular emphasis on (1) the most important existing syntax for open-universe semantics, i.e., the number statement444The specialized syntax in BLOG to express models with infinite number of variables., and (2) a newly introduced syntax to accommodate MTBN, i.e., the Mix distribution. Further description of BLOG’s syntax can be found in Li & Russell (2013).

5.1 Syntax of measure-theoretic BLOG

1 Type Applicant, Country;
2 distinct Country NewZealand, India, USA;
3 #Applicant(Nationality = c) ~
4  if (c==USA) then Poisson(50)
5  else Poisson(5);
6 origin Country Nationality(Applicant);
7 random Real GPA(Applicant s) ~
8  if Nationality(s) == USA then
9      Mix({ TruncatedGauss(3, 1, 0, 4) -> 0.9998,
10          4 -> 0.0001, 0 -> 0.0001})
11 else Mix({ TruncatedGauss(5, 4, 0, 10) -> 0.989,
12           10 -> 0.009, 0 -> 0.002});
13 random Applicant David ~
14     UniformChoice({a for Applicant a});
15 obs GPA(David) = 4;
16 query Nationality(David) = USA;
Figure 1: A BLOG code for the GPA example.

Fig. 1 shows a BLOG model with measure-theoretic extensions for a multi-student GPA example. Line 1 declares two types, Applicant and Country. Line 2 defines 3 distinct countries with keyword distinct, New Zealand, India and USA. Line 3 to 5 defines a number statement, which states that the number of US applicants follows a Poisson distribution with a higher mean than those from New Zealand or India. Line 6 defines an origin function, which maps the object being generated to the arguments that were used in the number statement that was responsible for generating it. Here Nationality maps applicants to their nationalities. Line 7 and 13 define two random variables by keyword random. Line 7 to 12 states that the GPA of an applicant is distributed as a mixture of weighted discrete and continuous distributions. For US applicants, the range of values follows a truncated Gaussian with bounds 0 and 4 (line 9). The probability mass outside the range is attributed to the corresponding bounds: (lines 10). GPA distributions for other countries are specified similarly. Line 13 defines a random applicant . Line 15 states that the David’s GPA is observed to be 4 and we query in line 16 whether David is from USA.

Number Statement (line 3 to 5)
Figure 2: Syntax of number statements

Fig. 2 shows the syntax of a number statement for . In this specification, are origin functions (discussed below); are tuples of arguments drawn from ; are first-order formulas with free variables ; are tuples of expressions over a subset of ; and specify kernels where is the type of the expression .

The arguments provided in a number statement allow one to utilize information about the rest of the model (and possibly other generated objects) while describing the number of objects that should be generated for each type. These assignments can be recovered using the origin functions , each of which is declared as:

where is the type of the argument in the number statement of where was used. The value of the variable used in the number statement that generated , an element of the universe, is given by . Line 6 in Fig. 1 is an example of origin function.

Mixture Distribution (line 9 to 12)

In the measure-theoretic BLOG, we introduce a new distribution, the mixture distribution (e.g., lines 9-10 in Fig. 1). A mixture distribution is specified as:

where are arbitrary distributions, and ’s are arbitrary real valued functions that sum to 1 for every possible assignment to their arguments: . Note that in our implementation of the measure-theoretical BLOG, we only allow Mix distribution to express a mixture of densities and masses for simplifying the system design, although it still possible to express the same semantics without Mix.

5.2 Semantics of measure-theoretic BLOG

In this section we study semantics of measure-theoretic BLOG and its theoretical properties. Every BLOG model implicitly defines a first-order vocabulary consisting of the set of functions and types mentioned in the model. BLOG’s semantics are based on the standard, open-universe semantics of first-order logic. We first define the set of all possible elements that may be generated for a BLOG model.

Definition 5.1.

The set of possible elements for a BLOG model with types is , where

  • , is a distinct constant in

  • , where is a number statement of type , is a tuple of elements of the type of from ,

Def. 5.1 allows us to define the set of random variables corresponding to a BLOG model.

Definition 5.2.

The set of basic random variables for a BLOG model , , consists of:

  • for each number statement , a number variable over the standard measurable space , where is of the type of .

  • for each function and tuple from of the type of , a function application variable with the measurable space , where is the measurable space corresponding to , the return type of .

We now define the space of consistent assignments to random variables.

Definition 5.3.

An instantiation of the basic RVs defined by a BLOG model is consistent if and only if:

  • For every element used in an assignment of the form or , ;

  • For every fixed function symbol with the interpretation , ; and

  • For every element , generated by the number statement , with origin functions , for every , . That is, origin functions give correct inverse maps.

Lemma 5.4.

Every consistent assignment to the basic RVs for defines a unique possible world in the vocabulary of .

The proof of Lemma 5.4 is in Appx. F. In the following definition, we use the notation to denote a substitution of every occurrence of the variable with in the expression . For any BLOG model , let ; for each , is the measurable space corresponding to . Let consist of the following edges for every number statement or function application statement of the form :

  • The edge if is a function symbol in such that appears in , and either or an occurrence of in uses quantified variables , is a tuple of elements of the type of and .

  • The edge , for element .

Note that the first set of edges defined in above may include infinitely many parents for . Let the dependency statement in the BLOG model corresponding to a number or function variable be . Let be the set of expressions used in . Each such statement then defines in a straightforward manner, a kernel . In order ensure consistent assignments, we include a special value for each in , and require that whenever violates the first condition of consistent assignments (Def. 5.3). In other words, all the local kernels ensure are locally consistent: variables involving an object get a non-null assignment only if the assignment to its number statement represents the generation of at least objects (). Each kernel of the form can be transformed into a kernel from its parent vertices (representing basic random variables) by composing the kernels determining the truth value of each expression in terms of the basic random variables, with the kernel . Let .

Definition 5.5.

The network for a BLOG model is defined using , the set of measurable spaces and the kernels for each vertex given by .

By Thm. 3.6, we have the main result of this section:

Theorem 5.6.

If a BLOG model ’s network is a well-founded digraph, then represents a unique measure on .

This theorem provides the theoretical foundation of the generalized BLOG language.

ΨΨ1 fixed Real sigma = 1.0; // stddev of observation
ΨΨ2 random Real FakeCoinDiff ~
ΨΨ3  TruncatedGaussian(0.5, 1, 0.1, 1);
ΨΨ4 random Bool hasFakeCoin ~ BooleanDistrib(0.5);
ΨΨ5 random Real obsDiff ~ if hasFakeCoin
ΨΨ6   then Gaussian(FakeCoinDiff, sigma*sigma)
ΨΨ7   else Mix({ 0 -> 1.0 });
ΨΨ8 obs obsDiff = 0;
ΨΨ9 query hasFakeCoin;
ΨΨ
Figure 3: BLOG code for the Scale example

6 Experiment Results

(a) GPA model
(b) Scale model
(c) Aircraft-Tracking model
Figure 4: Experiment results on (a) the GPA model, (b) the noisy scale model and (c) the aicraft-tracking model.

We implement the measure-theoretic extension of BLOG and evaluate our inference algorithms on three models where naive algorithms fail: (1) the GPA model (GPA); (2) the noisy scale model (Scale); and (3) a SSM, the aircraft tracking model (Aircraft-Tracking). The implementation is based on the BLOG’s C++ compiler (Wu et al., 2016).

GPA model: Fig. 1 contains the BLOG code for the GPA example as explained in Sec. 5. Since the GPA of David is exactly 4, the Bayes rule tells that David must be from USA. We evaluate LLW and the naive LW on this model in Fig 4(a), where the naive LW produces a completely wrong answer.

Scale model: In the noisy scale (Fig. 3), we have an even number of coins and there might be a fake coin among them (Line 4). The fake coin will be slightly heavier than a normal coin (Line 2-3). We divide the coins into two halves and place them onto a noisy scale. When no fake coin, the scale always balances (Line 7). When there is a fake coin, the scale will noisily reflect the weight difference with standard deviation (sigma in Line 6). Now we observe that the scale is balanced (Line 8) and we would like to infer whether a fake coin exists. We again compare LLW against the naive LW with different choices of the parameter in Fig. 4(b). Since the scale is precisely balanced, there must not be a fake coin. LLW always produces the correct answer while for naive LW, the result is incorrect and highly depends on the parameter: as increases, the output approaches the true probability.

Aircraft-Tracking model: Fig. 5 shows a simplified BLOG model for the aircraft tracking example. In this state space model, we have radar points (Line 1) and a single aircraft to track. Both the radars and the aircraft are considered as points on a 2D plane. The prior of the aircraft movement is a Gaussian process (Line 3 to 6). Each radar has an effective range radius(r): if the aircraft is within the range, the radar noisily measure the distance from the aircraft to its own location (Line 13); if the aircraft is out of range, the radar with almost surely just outputs its radius (Line 10 to 11). Now we observe the measurements from all the radars points for time steps and we want to infer the location of the aircraft. With the measure-theoretic extension, the generalized BLOG program provides more expressiveness power for modelling truncated sensors: if a radar outputs exactly its radius, we can surely infer that the aircraft must be out of the effective range of this radar. However, this information cannot be captured by the original BLOG PPL. To illustrate this case, we manually generate a synthesis dataset of time steps555The full code with detailed data are in https://goo.gl/f7qLwy. and evaluate LPF against the naive particle filter with different number of particles in Fig. 4(c). We take the mean of the samples from all the particles as the predicted aircraft location. Since we know the ground truth, we measure the average mean square error between the true location and the prediction. LPF accurately predicted the true locations while naive PF converges to the wrong results.

1  type t_radar; distinct t_radar R[6];
2  // model aircraft movement
3  random Real X(Timestep t) ~ if t == @0
4    then Gaussian(2, 1) else Gaussian(X(prev(t)), 4);
5  random Real Y(Timestep t) ~ if t == @0
6    then Gaussian(-1, 1) else Gaussian(Y(prev(t)), 4);
7  // observation model of radars
8  random Real obs_dist(Timestep t, t_radar r) ~
9    if dist(X(t),Y(t),r) > radius(r) then
10     mixed({radius(r)->0.999,
11    ΨTruncatedGauss(radius(r),0.01,0,radius(r))->0.001})
12   else
13     TruncatedGauss(dist(X(t),Y(t),r),0.01,0,radius(r));
14 // observation and query
15 obs obs_dist(@0, R[0]) = ...;
16 ... // evidence numbers omitted
17 query X(t) for Timestep t;
18 query Y(t) for Timestep t;
Figure 5: BLOG code for the Aircraft-Tracking example

7 Conclusion

In this paper we proposed measure-theoretic Bayesian network, a general framework to generalize existing PPL semantics to support random variables over arbitrary measure spaces, and provably correct inference algorithms in this framework to handle discrete-continuous mixtures. We also incorporate MTBN into a widely used PPL, BLOG, by a simple syntax extension and implement the algorithms into the generalized BLOG PPL, which makes the PPL system practical for a much larger domain of applications.

We also believe that together with the foundational inference algorithms, our proposed rigorous framework will facilitate the development of powerful techniques for probabilistic reasoning in practical applications from a much wider range of scientific areas.

Acknowledgment

This work is supported by the DARPA PPAML program, contract FA8750-14-C-0011. Simon Du is funded by NSF grant IIS1563887, AFRL grant FA8750-17-2-0212 and DARPA D17AP00001.

References

  • Carpenter et al. (2016) Carpenter, Bob, Gelman, Andrew, Hoffman, Matt, Lee, Daniel, Goodrich, Ben, Betancourt, Michael, Brubaker, Michael A, Guo, Jiqiang, Li, Peter, Riddell, Allen, et al. Stan: A probabilistic programming language. Journal of Statistical Software, 20(2):1–37, 2016.
  • Choi et al. (2010) Choi, Jaesik, Amir, Eyal, and Hill, David J. Lifted inference for relational continuous models. In UAI, volume 10, pp. 126–134, 2010.
  • Doucet et al. (2001) Doucet, Arnaud, De Freitas, Nando, and Gordon, Neil. An introduction to sequential monte carlo methods. In Sequential Monte Carlo methods in practice, pp. 3–14. Springer, 2001.
  • Durrett (2013) Durrett, Rick. Probability: Theory and Examples. Cambridge University Press, 2013.
  • Gao et al. (2017) Gao, Weihao, Kannan, Sreeram, Oh, Sewoong, and Viswanath, Pramod. Estimating mutual information for discrete-continuous mixtures. In Advances in Neural Information Processing Systems, pp. 5988–5999, 2017.
  • Goodman (2013) Goodman, Noah D. The principles and practice of probabilistic programming. In ACM SIGPLAN Notices, volume 48, pp. 399–402. ACM, 2013.
  • Goodman & Stuhlmüller (2014) Goodman, Noah D and Stuhlmüller, Andreas. The Design and Implementation of Probabilistic Programming Languages. http://dippl.org, 2014. Accessed: 2018-6-5.
  • Goodman et al. (2008) Goodman, Noah D, Mansinghka, Vikash K, Roy, Daniel M, Bonawitz, Keith, and Tenenbaum, Joshua B. Church: A language for generative models. In UAI-08, 2008.
  • Gutmann et al. (2011a) Gutmann, Bernd, Jaeger, Manfred, and De Raedt, Luc. Extending problog with continuous distributions. In Inductive Logic Programming, pp. 76–91. Springer, 2011a.
  • Gutmann et al. (2011b) Gutmann, Bernd, Thon, Ingo, Kimmig, Angelika, Bruynooghe, Maurice, and De Raedt, Luc. The magic of logical inference in probabilistic programming. Theory and Practice of Logic Programming, 11(4-5):663–680, 2011b.
  • Jech (2003) Jech, Thomas. Set theory. Springer, 2003.
  • Kallenberg (2002) Kallenberg, Olav. Foundations of Modern Probability. Springer, 2002. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0387953132.
  • Kersting & De Raedt (2007) Kersting, Kristian and De Raedt, Luc. Bayesian logic programming: Theory and tool. Statistical Relational Learning, pp. 291, 2007.
  • Kharchenko et al. (2014) Kharchenko, Peter V, Silberstein, Lev, and Scadden, David T. Bayesian approach to single-cell differential expression analysis. Nature methods, 11(7):740, 2014.
  • Koller et al. (1997) Koller, Daphne, McAllester, David, and Pfeffer, Avi. Effective bayesian inference for stochastic programs. In AAAI-97, 1997.
  • Kulkarni et al. (2015) Kulkarni, Tejas D, Kohli, Pushmeet, Tenenbaum, Joshua B, and Mansinghka, Vikash. Picture: A probabilistic programming language for scene perception. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 4390–4399, 2015.
  • Lake et al. (2015) Lake, Brenden M, Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • Le et al. (2017) Le, Tuan Anh, Baydin, Atilim Gunes, and Wood, Frank. Inference compilation and universal probabilistic programming. In Artificial Intelligence and Statistics, pp. 1338–1348, 2017.
  • Li & Russell (2013) Li, Lei and Russell, Stuart J. The blog language reference. Technical report, Technical Report UCB/EECS-2013-51, EECS Department, University of California, Berkeley, 2013.
  • Mansinghka et al. (2014) Mansinghka, Vikash, Selsam, Daniel, and Perov, Yura. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv preprint arXiv:1404.0099, 2014.
  • McAllester et al. (2008) McAllester, David, Milch, Brian, and Goodman, Noah D. Random-world semantics and syntactic independence for expressive languages. Technical report, 2008.
  • Milch et al. (2005a) Milch, Brian, Marthi, Bhaskara, Russell, Stuart J., Sontag, David, Ong, Daniel L., and Kolobov, Andrey. BLOG: Probabilistic models with unknown objects. In Proc. of IJCAI, pp. 1352–1359, 2005a.
  • Milch et al. (2005b) Milch, Brian, Marthi, Bhaskara, Sontag, David, Russell, Stuart, Ong, Daniel L., and Kolobov, Andrey. Approximate inference for infinite contingent Bayesian networks. In Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 2005b. URL http://www.gatsby.ucl.ac.uk/aistats/AIabst.htm.
  • Milch (2006) Milch, Brian Christopher. Probabilistic models with unknown objects. PhD thesis, University of California at Berkeley, Berkeley, CA, USA, 2006.
  • Moore & Russell (2017) Moore, David and Russell, Stuart. Signal-based bayesian seismic monitoring. In Artificial Intelligence and Statistics, pp. 1293–1301, 2017.
  • Nitti et al. (2016) Nitti, Davide, De Laet, Tinne, and De Raedt, Luc. Probabilistic logic programming for hybrid relational domains. Machine Learning, 103(3):407–449, 2016.
  • Pearl (1988) Pearl, Judea. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1988.
  • Pfeffer (2009) Pfeffer, Avi. Figaro: An object-oriented probabilistic programming language. Charles River Analytics Technical Report, 137:96, 2009.
  • Pierson & Yau (2015) Pierson, Emma and Yau, Christopher. Zifa: Dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome biology, 16(1):241, 2015.
  • Ramsey (2016) Ramsey, Norman. All you need is the monad.. what monad was that again. In PPS Workshop, 2016.
  • Ritchie et al. (2015) Ritchie, Daniel, Lin, Sharon, Goodman, Noah D, and Hanrahan, Pat. Generating design suggestions under tight constraints with gradient-based probabilistic programming. In Computer Graphics Forum, volume 34, pp. 515–526. Wiley Online Library, 2015.
  • Shan & Ramsey (2017) Shan, Chung-chieh and Ramsey, Norman. Exact bayesian inference by symbolic disintegration. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, pp. 130–144. ACM, 2017.
  • Singla & Domingos (2007) Singla, Parag and Domingos, Pedro. Markov logic in infinite domains. In In Proc. UAI-07, 2007.
  • Staton (2017) Staton, Sam. Commutative semantics for probabilistic programming. In European Symposium on Programming, pp. 855–879. Springer, 2017.
  • Stuhlmüller & Goodman (2014) Stuhlmüller, Andreas and Goodman, Noah D. Reasoning about reasoning by nested conditioning: Modeling theory of mind with probabilistic programs. Cognitive Systems Research, 28:80–99, 2014.
  • Tolpin et al. (2016) Tolpin, David, van de Meent, Jan Willem, Yang, Hongseok, and Wood, Frank. Design and implementation of probabilistic programming language anglican. arXiv preprint arXiv:1608.05263, 2016. URL https://github.com/probprog/anglican-examples/blob/master/worksheets/indian-gpa.clj.
  • Tran et al. (2016) Tran, Dustin, Kucukelbir, Alp, Dieng, Adji B., Rudolph, Maja, Liang, Dawen, and Blei, David M. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787, 2016.
  • Wang & Domingos (2008) Wang, Jue and Domingos, Pedro. Hybrid markov logic networks. In AAAI, volume 8, pp. 1106–1111, 2008.
  • Wood et al. (2014a) Wood, Frank, Meent, Jan Willem, and Mansinghka, Vikash. A new approach to probabilistic programming inference. In Artificial Intelligence and Statistics, pp. 1024–1032, 2014a.
  • Wood et al. (2014b) Wood, Frank, van de Meent, Jan Willem, and Mansinghka, Vikash. A new approach to probabilistic programming inference. In Proceedings of the 17th International conference on Artificial Intelligence and Statistics, pp. 1024–1032, 2014b.
  • Wu et al. (2014) Wu, Yi, Li, Lei, and Russell, Stuart. BFiT: From possible-world semantics to random-evaluation semantics in open universe. 3rd NIPS Workshop on Probabilistic Programming, 2014.
  • Wu et al. (2016) Wu, Yi, Li, Lei, Russell, Stuart, and Bodik, Rastislav. Swift: Compiled inference for probabilistic programming languages. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2016.

Appendix A Background on Measure-theoretical Probability Theory

We assume familiarity with measure-theoretic approaches to probability theory, but provide the fundamental definitions. The standard Borel -algebra is assumed in all the discussion. See (Durrett, 2013) and (Kallenberg, 2002) for introduction and further details.

A measurable space (space, for short) is an underlying set paired with a -algebra of measurable subsets of , i.e., a family of subsets containing the underlying set which is closed under complements and countable unions. We’ll denote the measurable space simply by where no ambiguity results. A function between measurable spaces is measurable if measurable sets pullback to measurable sets: for all . A measure on a measurable space is a function which satisfies countable additivity: for any countable sequence of disjoint measurable sets . denotes the probability of a statement under the base measure , and similarly for conditional probabilities. A probability kernel is the measure-theoretic generalization of a conditional distribution. It is commonly used to construct measures over a product space, analogously to how conditional distributions are used to define joint distributions in the chain rule.

Definition A.1.

A probability kernel from one measurable space to another is a function such that (a) for every , is a probability measure over , and (b) for every , is a measurable function from to .

Given an arbitrary index set and spaces for each index , the product space is the space with underlying set the Cartesian product of the underlying sets, adorned with the smallest -algebra such that the projection functions are measurable.

Appendix B MTBNs Represent Unique Measures

We prove here Theorem 3.6. Its proof requires a series of intermediate results. We first define a projective family of measures. This gives a way to recursively construct our measure . We define a notion of consistency such that every consistent projective family constructs a measure that represents. We end by giving an explicit characterization of the unique consistent projective family, and thus of the unique measure represents. The appendix contains additional technical material required in the proofs.

Intuitively, the main objective of this section is to show that an MTBN defines a unique measure that “factorizes” according to the network, as an extension to the corresponding result for Bayes Nets.

b.1 Consistent projective family of measures

Let be a kernel from and a kernel from . Their composition (note the ordering!) is a kernel from to defined for , by:

(5)

To allow uniform notation, we will treat measurable functions and measures as special cases of kernels. A measurable function corresponds to the kernel from to given by for and . A measure on a space is a kernel from , the one element measure space, to given by for . Where this yields no confusion, we use and in place of and . (5) simplifies if the kernels are measures or functions. Let be a measure on , be a kernel from to , be a measurable function from to , and be a measurable function from to . Then is a measure on and is a kernel from to with: , and .

Let denote the class of upwardly closed sets: subsets of containing all their elements’ parents.

Definition B.1.

A projective family of measures is a family consisting of a measure on for every such that whenever we have , i.e., for all , .

Def. B.1 captures the measure-theoretic version of the probability of a subset of variables being equal to the marginals obtained while “summing out” the probabilities of the other variables in a joint distribution.

Definition B.2.

Let be a measure on a measure space , and a kernel from to a measure space . Then is the measure on defined for by: .

Def. B.2 defines the operation of composing a conditional probability with a prior on a parent, to obtain the corresponding joint distribution.

Definition B.3.

Let for be kernels from to . Denote by the kernel from to defined for each by the infinite product of measures: .

See (Kallenberg, 2002) 1.27 and 6.18 for definition and existence of infinite products of measures. Def. B.3 captures the kernel representation for taking the equivalent of products of conditional distributions of a set of variables with a common set of parents.

Definition B.4.

A projective family is consistent with if for any such that and , then: .

Consistency in Def. B.4 captures the global condition that we would like to see in a generalization of a Bayes network. Namely, the distribution of any set of parent-closed random variables should “factorize” according to the network

A projective family is consistent with exactly when represents :

Lemma B.5.

Let be a measure on , and define the projective family by . This projective family is consistent with iff represents .

Proof.

First we’ll relate consistency (Def. 8) with conditional expectation and distribution properties of random variables. Take any such that and and observe that the following are equivalent:

  • is a version of the conditional distribution of given ,

  • is a version of the conditional distribution of given for all , and are mutually independent conditional on .

The forward direction is straightforward. For the converse we use the fact that conditional independence of families of random variables holds if it holds for all finite subsets, establishing that by chaining conditional independence (see (Kallenberg, 2002) p109 and 6.8).

Lemma B.5 shows that Def. B.4 follows iff an MTBN represents the joint distribution – in other words, it follows iff the local Markov property holds.

b.2 There exists a unique consistent family

Each vertex is assigned the unique minimal ordinal such that whenever (see (Jech, 2003) for an introduction to ordinals). For any denote by the restriction of to vertices of depth less than . Defining , the least strict upper bound on depth, we have that for all . In the following, fix a limit ordinal .

Definition B.6.

is a projective sequence of measures on if whenever we have .

Def. B.6 generalizes the notion of subset relationships and the marginalization operations that hold between supersets and subsets to the case of infinite dependency chains

Definition B.7.

The limit of a projective sequence of measures is the unique measure on such that for all .

Definition B.8.

Given any , inductively define a measure on by

stabilizes for to define a measure on .

The above definition is coherent as can be inductively shown to be a projective sequence. Lemma B.9 and B.10 allow us to show in Theorem B.11 that is the unique consistent projective family of measures.

Lemma B.9.

If for , then for all :

Proof is in Appx. C.

Lemma B.10.

If where , and if , then , , and

Proof is in Appx. D.

Using the above, the following shows MTBNs satisfy properties (1-3) from footnote 1 in introduction:

Theorem B.11.

is the unique projective family of measures consistent with .

Proof is in Appx. E.

Intuitively, by Lemma B.9 and Lemma B.10, we assert that consistency holds for any ordinal-bounded (prefix in terms of parent ordering) sub-network. Then the main result, Thm. B.11, follows by setting this bound appropriately. Finally Lemma B.5 and Theorem B.11 lead to Theorem 3.6.

Appendix C Proof for Lemma b.9

Proof.

Proof by induction. Trivially true for , so suppose this holds for , and consider . Then: