Semantics for probabilistic programming: higher-order functions, continuous distributions, and soft constraints
We study the semantic foundation of expressive probabilistic programming languages, that support higher-order functions, continuous distributions, and soft constraints (such as Anglican, Church, and Venture). We define a metalanguage (an idealised version of Anglican) for probabilistic computation with the above features, develop both operational and denotational semantics, and prove soundness, adequacy, and termination. This involves measure theory, stochastic labelled transition systems, and functor categories, but admits intuitive computational readings, one of which views sampled random variables as dynamically allocated read-only variables. We apply our semantics to validate nontrivial equations underlying the correctness of certain compiler optimisations and inference algorithms such as sequential Monte Carlo simulation. The language enables defining probability distributions on higher-order functions, and we study their properties.
CONF ’yyMonth d–d, 20yy, City, ST, Country \copyrightyear20yy \copyrightdata978-1-nnnn-nnnn-n/yy/mm \copyrightdoinnnnnnn.nnnnnnn
Sam Staton and Hongseok Yang and Frank Wood University of Oxford \authorinfoChris Heunen University of Edinburgh \authorinfoOhad Kammar University of Cambridge
Probabilistic programming is the idea to use programs to specify probabilistic models; probabilistic programming languages blend programming constructs with probabilistic primitives. This helps scientists express complicated models succinctly. Moreover, such languages come with generic inference algorithms, relieving the programmers of the nontrivial task of (algorithmically) answering queries about their probabilistic models. This is useful in e.g. machine learning.
Several higher-order probabilistic programming languages have recently attracted a substantial user base. Some languages (such as Infer.net Minka et al. , PyMC Patil et al. , and Stan Stan Development Team ) are less expressive but provide powerful inference algorithms, while others (such as Anglican Wood et al. , Church Goodman et al. , and Venture Mansinghka et al. ) have less efficient inference algorithms but more expressive power. We consider the more expressive languages, that support higher-order functions, continuous distributions, and soft constraints. More precisely, we consider a programming language (§3) with higher-order functions (§6) as well as the following probabilistic primitives.
The command draws a sample from a distribution described by , which may range over the real numbers.
- Soft constraints
The command puts a score (a positive real number) on the current execution trace. This is typically used to record that some particular datum was observed as being drawn from a particular distribution; the score describes how surprising the observation is.
The command runs a simulation algorithm over the program fragment . This takes the scores into account and returns a new, normalized probability distribution. The argument to might be a primitive distribution, or a distribution defined by normalizing another program. This is called a nested query, by analogy with database programming.
Here is a simple example of a program. We write for the Gaussian probability distribution whose density function is .
Line 2 samples from a prior Gaussian distribution. The soft constraint on Line 3 expresses the likelihood of the observed data, , coming from a Gaussian given the prior . Line 4 says that what we are actually interested in is a boolean random variable over the sample space. Line 1 calculates a posterior distribution for the return value, using the prior and the likelihood. In this example we can precisely calculate that the posterior distribution on has .
Languages like this currently lack formal exact semantics. The aim of this paper is to provide just such a foundation as a basis for formal reasoning. Most expressive probabilistic programming languages are explained in terms of their Monte Carlo simulation algorithms. The simplest such algorithm, using importance and rejection sampling, is the de facto semantics against which other algorithms are ‘proved approximately correct’. Such ‘semantics’ are hard to handle and extend.
We provide two styles of semantics, operational and denotational. For first-order probabilistic programs, the denotational semantics is straightforward: types are interpreted as measurable spaces, and terms are interpreted as measurable functions (§4). Operational semantics is more complicated. For discrete distributions, an operational semantics might be a probabilistic transition system, but for continuous distributions, it must be a stochastic relation (labelled Markov process). We resolve this by equipping the set of configurations with the structure of a measurable space (§5).
The advantage to the operational semantics is that it is easily extended to higher-order programs (§7). Denotational semantics for higher-order programs poses a problem, because measurable spaces do not support the usual theory of functions: they do not form a Cartesian closed category (indeed, does not exist as a measurable space Aumann ). Earlier work dealt with this either by excluding higher-order functions or by considering only discrete distributions. We resolve this by moving from the category of measurable spaces, where standard probability theory takes place, to a functor category based on it (§8). The former embeds in the latter, so we can still interpret first-order concepts. But the functor category does have well-behaved function spaces, so we can also interpret higher-order concepts. Moreover, by lifting the monad of probability distributions Giry  to the functor category, we can also interpret continuous distributions. Finally, we can interpret observations by considering probability distributions with continuous density, irrespective of the categorical machinery (§9).
The denotational semantics is sound and adequate with respect to the operational semantics (§5.3,8.3), which means one can use the denotational model to directly check program equations while respecting computational issues. For example:
We recall basic definitions and facts of measure theory.
A -algebra on a set is a family of subsets of , called measurable (sub)sets, which contains and is closed under complements and countable unions. A measurable space is a set with a -algebra.
A measure on a measurable space is a function such that and for each sequence of disjoint measurable sets. A probability measure or probability distribution is a measure with .
In the paper we use a few important constructions for measurable spaces. The first is to make a set into a measurable space by taking the full powerset of as , yielding a discrete measurable space. When is countable, a probability distribution on is determined by its values on singleton sets, that is, by specifying a function such that .
The second construction is to combine a collection of measurable spaces by sum or product. The underlying sets are the disjoint union and product of sets. The measurable sets in the sum are for . The -algebra of the product is the smallest one containing all the subsets where equals but for a single index .
For the third, the real numbers form a measurable space under the smallest -algebra that contains the open intervals; the measurable sets are called Borel sets. Restricting to any measurable subset gives a new measurable space, such as the space of nonnegative reals and the unit interval .
The fourth construction is to make the set of all probability measures on a measurable space into a measurable space, by letting be the smallest -algebra containing the sets for all and .
Let , be measurable spaces. A function is measurable if for .
We can push forward a measure along a measurable function: if is a probability measure on and is a measurable function, then is a probability measure on .
A stochastic relation between measurable spaces and is a function such that is a probability distribution for all , and is measurable for all .
Giving a stochastic relation from to is equivalent to giving a measurable function . Stochastic relations and compose associatively to via the formula
Finally, for a predicate , we use the indicator expression to denote if holds, and otherwise.
3 A first-order language
This section presents a first-order language for expressing Bayesian probabilistic models. The language forms a first-order core of a higher-order extension in Section 6, and provides a simpler setting to illustrate key ideas. The language includes infinitary type and term constructors, constant terms for all measurable functions between measurable spaces, and constructs for specifying Bayesian probabilistic models, namely, operations for sampling distributions, scoring samples, and normalizing distributions based on scores. This highly permissive and slightly unusual syntax is not meant to be a useful programming language itself. Rather, its purpose is to serve as a semantic metalanguage to which a practical programming language compiles, and to provide a mathematical setting for studying high-level constructs for probabilistic computation.
The language has types
where ranges over countable sets. A type stands for a measurable space . For example, denotes the measurable space of reals, is the space of probability measures on , and is the (discrete) measurable space on the singleton set. The other type constructors correspond to products and sums of measurable spaces. Notice that countable sums are allowed, enabling us to express usual ground types in programming languages via standard encoding. For instance, the type for booleans is , and that for natural numbers .
We distinguish typing judgements: for deterministic terms, and for probabilistic terms (see also e.g. Levy et al. ; Park et al. ; Ramsey and Pfeffer ). In both, is a type, and is a list of variable/type pairs. Variables stand for deterministic terms.
Intuitively, a probabilistic term may have two kinds of effects: during evaluation, may sample from a probability distribution, and it may score the current execution trace (typically according to likelihood of data). Evaluating a deterministic term does not have any effects.
Sums and products
The language includes variables, and standard constructors and destructors for sum and product types.
In the rules for sums, may be infinite. In the last rule, is or . We use some standard syntactic sugar, such as and for the injections in the type , and for in that instance.
The language has constant terms for all measurable functions.
In particular, all the usual distributions are in the language, including the Dirac distribution concentrated on outcome , the Gaussian distribution with mean and standard deviation , the Bernoulli distribution with success probability , the exponential distribution with rate , and the Beta distribution with parameters .111Usually, is undefined, but we just let , and so on, to avoid worrying about this sort of error. For example, from the measurable functions , , and we can derive:
The following terms form the core of our language.
The first term samples a value from a distribution . The second multiplies the score of the current trace with . Since both of these terms express effects, they are typed under instead of .
The reader may think of the score of the current execution trace as a state, but it cannot be read, nor changed arbitrarily: it can only be multiplied by another score. The argument in is usually the density of a probability distribution at an observed data point. For instance, recall the example (1) from the Introduction:
An execution trace is scored higher when is closer to the datum .
When the argument in is , this is called a hard constraint, meaning ‘reject this trace’; otherwise it is called a soft constraint. In the discrete setting, hard constraints are more-or-less sufficient, but even then, soft constraints tend to be more efficient.
Two representative tasks of Bayesian inference are to calculate the so-called posterior distribution and model evidence from a prior distribution and likelihood. Programs built from and can be thought of as setting up a prior and a likelihood. Consider the following program:
Here the prior comes from the Bernoulli distribution, and the likelihood concerns datum coming from an exponential distribution with rate . Recall that . So there are two execution traces, returning either , with probability and score , or , with probability and score .
The product of the prior and likelihood gives an unnormalized posterior distribution on the return value: . The normalizing constant is the average score: , so the posterior is . The average score is called the model evidence. It is a measure of how well the model encoded by the program matches the observation.
Note that the sample matches the datum better, so the probability of goes up from to in the posterior.
In our language we have a term that will usually convert a probabilistic term into a deterministic value, which is its posterior distribution together with the model evidence.
If the model evidence is or , the conversion fails, and this is tracked by the ‘’.
4 Denotational semantics
This section discusses the natural denotational semantics of the first-order language. The basic idea can be traced back a long way (e.g. Kozen ) but our treatment of and appear to be novel. As described, types are interpreted as measurable spaces . A context is interpreted as the measurable space of its valuations.
Deterministic terms are interpreted as measurable functions , providing a result for each valuation of the context.
Probabilistic terms are interpreted as measurable functions , providing a probability measure on (score,result) pairs for each valuation of the context.
Informally, if is the prior sample space of the program (specified by ), and is the likelihood (specified by ), and is the return value, then a measure in is found by pushing forward along .
Sums and products
The interpretation of deterministic terms follows the usual pattern of the internal language of a distributive category (e.g. Pitts ). For instance, , and for measurable . This interpretation is actually the same as the usual set-theoretic semantics of the calculus, as one can show by induction that the induced functions are measurable.
For probabilistic terms, we proceed as follows.
which denotes a Dirac distribution, and is
As we will explain shortly, these interpretations come from treating as a commutative monad, which essentially means the following program equations hold.
The last equation justifies a useful program optimisation technique [Wood et al., 2014, §5.5]. (The last two equations require assumptions about free variables of , and , to avoid variable capture.)
We use the monad:
Here are some program equations to illustrate the semantics so far.
We interpret as follows. Each probability measure on induces an unnormalized posterior measure on : let . As long as the average score is not or , we can normalize to build a posterior probability measure on . This construction forms a natural transformation
Let . Here are some examples:
The third equation shows how infinite model evidence errors can arise when working with infinite distributions. In the last equation, the parameter of represents the probability of under . The equation expresses the so called conjugate-prior relationship between Beta and Bernoulli distributions, which has been used to optimise probabilistic programs Yang .
The interpretation of and given above arises from the fact that is a commutative monad on the category of measurable spaces and measurable functions (see also [Doberkat, 2007, §2.3.1], [scibor:haskell, , §6]). Recall that a commutative monad in general comprises an endofunctor together with natural transformations , , satisfying some laws Kock . Using this structure we interpret and following Moggi Moggi :
Concretely, we make into a monad by combining the standard commutative monad structure Giry  on and the commutative monoid with the monoid monad transformer.
(We record a subtle point about (3). It is not a monad morphism, and we we cannot form a commutative monad of all measures because the Fubini property does not hold in general.)
4.1 Sequential Monte Carlo simulation
The program equations above justify some simple program transformations. For a more sophisticated one, consider sequential Monte Carlo simulation Doucet et al. . A key idea of its application to probabilistic programs is: ‘Whenever there is a , it is good to renormalize and resample’. This increases efficiency by avoiding too many program executions with low scores [Paige and Wood, 2014, Algorithm 1].
The denotational semantics justifies the soundness of this transformation. For a term with top-level , i.e. a term of the form where and may have free:
Let us explain the right hand side. Line 1 renormalizes the program after the , and in non-exceptional execution returns the model evidence and a new normalized distribution . Line 2 immediately records the evidence as a score, and then resamples , using the resampled value in the continuation . Line 3 propagates the error of : is a deterministic term of the right type whose choice does not matter. Finally, line 4 detects an infinite evidence error, and undoes the transformation. This error does not arise in most applications of sequential Monte Carlo simulation.
5 Operational semantics
In this section we develop an operational semantics for the first-order language. There are several reasons to consider this, even though the denotational semantics is arguably straightforward. First, extension to higher-order functions is easier in operational semantics than in denotational semantics. Second, operational semantics conveys computational intuitions that are obscured in the denotational semantics. We expect these computational intuitions to play an important role in studying approximate techniques for performing posterior inference, such as sequential Monte Carlo, in the future.
Sampling from probability distributions complicates operational semantics. Sampling from a discrete distribution can immediately affect control flow. For example, in the term
the conditional depends on the result of sampling the Bernoulli distribution. The result is with probability (cf. [Borgström et al., 2013, §2.3]).
Sampling a distribution on introduces another complication. Informally, there is a transition
for every real , but any single transition has zero probability. We can assign non-zero probabilities to sets of transitions; informally:
To make this precise we need a -algebra on the set of terms, which can be done using configurations rather than individual terms. A configuration is a closure: a pair of a term with free variables and an environment giving values for those variables as elements of a measurable space. (See also [Hur et al., 2015, §3], [Borgström et al., 2015, §3].)
Sampling a distribution on exhibits both complications:
The control flow in the distinction depends on which summand is sampled, but there is potentially a continuous distribution over the return values. We handle this by instantiating the choice of summand in the syntax, but keeping the value of the summand in the environment, so that expression (4) can make a step to the closure
A type is indecomposable if it has the form or , and a context is canonical if it only involves indecomposable types.
Let . A -configuration of type is a triple comprising a canonical context , a term , and an element of the measurable space . We identify contexts that merely rename variables, such as
We call d-configurations deterministic configurations, and p-configurations probabilistic configurations; they differ only in typing. We will abbreviate configurations to when is obvious.
Values in a canonical context are well-typed deterministic terms of the form
where is a variable in . Similarly, a probabilistic term in context is called probabilistic value or p-value if for some value . Remember from Section 4 that the denotational semantics of values is simple and straightforward.
Write and for the sets of deterministic and probabilistic configurations of type , and make them into measurable spaces by declaring to be measurable if the set is measurable for all judgements .
Further partition into and based on whether a term in a configuration is a value or not:
Particularly well-behaved values are the ordered values , where each variable appears exactly once, and in the same order as in .
Consider a canonical context , a type , an ordered value , and the induced measurable function
The collection of all such functions for given is countable, and forms a coproduct diagram.
By induction on the structure of types. The key fact is that every type is a sum of products of indecomposable ones, because the category of measurable spaces is distributive, i.e. the canonical map is an isomorphism. ∎
For example, has 3 ordered values, first , second , and third , inducing a canonical measurable isomorphism .
We distinguish three kinds of evaluation contexts: is a context for a deterministic term with a hole for deterministic terms; and are contexts for probabilistic terms, the former with a hole for probabilistic terms, the latter with a hole for deterministic terms.
where , are general terms and is a value.
Using the tools developed so far, we will define a measurable function for describing the reduction of d-configurations, and a stochastic relation for describing reduction of p-configurations:
parameterized by a family of measurable ‘normalization’ functions
indexed by types .
Reduction of deterministic terms
Define a type-indexed family of relations as the least one that is closed under the following rules.
The rule for keeps the original context and the closure because they might be used in the continuation, even though they are not used in . The rules obey the following invariant.
If , then and for some and .
By induction on the structure of derivations.∎
This lemma allows us to confirm that our specification of a relation is well-formed (‘type preservation’).
The induced relation is a measurable function.
There are three things to show: that the relation is entire (‘progress’); that the relation is single-valued (‘determinacy’); and that the induced function is measurable. All three are shown by induction on the structure of terms. The case of application of measurable functions crucially uses Lemma 5.1. ∎
Reduction of probabilistic terms
Next, we define the stochastic relation for probabilistic terms, combining two standard approaches: for indecomposable types, which are uncountable, use labelled Markov processes, i.e. give a distribution on the measurable set of resulting configurations; for decomposable types (sums, products etc.), probabilistic branching is discrete and so a transition system labelled by probabilities suffices.
Let be an indexed family of measurable spaces. Suppose we are given:
a function that is only nonzero on a countable subset , and such that ;
a probability measure on for each .
This determines a probability measure on by
for a measurable subset of ,
We will use three entities to define the desired stochastic relation .
A countably supported probability distribution on the set for each . We write for the probability of .
A probability measure on the space for each and with . Write for the probability of a measurable subset .
A measurable function , representing the score of the one-step transition relation. (For one-step transitions, the score is actually deterministic.)
These three entities are defined by induction on the structure of the syntax of -typed p-configurations in Figure 1. We combine them to define a stochastic relation as follows.