Approximating conditional distributions

Approximating conditional distributions

Alberto Chiarini ETH Zürich, Department of Mathematics Rämistrasse 101, 8092, Zurich, Switzerland alberto.chiarini@math.ethz.ch Alessandra Cipriani Department of Mathematical Sciences, University of Bath Claverton Down, Bath, BA2 7AY, United Kingdom A.Cipriani@bath.ac.uk  and  Giovanni Conforti CMAP, École Polytechnique Route de Saclay, 91128 Palaiseau Cedex, France. giovanni.conforti@polytechnique.edu
July 27, 2019
Abstract.

In this article, we discuss the basic ideas of a general procedure to adapt the Stein–Chen method to bound the distance between conditional distributions. From an integration-by-parts formula (IBPF), we derive a Stein operator whose solution can be bounded, for example, via ad hoc couplings. This method provides quantitative bounds in several examples: the filtering equation, the distance between bridges of random walks and the distance between bridges and discrete schemes approximating them. Moreover, through the coupling construction for a certain class of random walk bridges we determine samplers, whose convergence to equilibrium is computed explicitly.

Key words and phrases:
2010 Mathematics Subject Classification:
The authors thank their previous academic affiliations (CEMPI Lille, the University of Aix—Marseille, the University of Leipzig, and WIAS Berlin), as well as the Max-Planck-Institut Leipzig, where part of this research was carried out. The second author is supported by the EPSRC grant EP/N004566/1.

1. Introduction

Stein’s method is a powerful tool to determine quantitative approximations of random variables in a wide variety of contexts. It was first introduced by Stein (1972) and developed by Chen (1975a, b), which is why it is often called the Stein–Chen method. Stein originally implemented it for a central limit approximation, but later his idea found a much wider range of applications. In fact, his method has a big advantage over several other techniques, in that it can be used for approximation in terms of any distribution on any space, and moreover does not require strong independence assumptions. Enhancing the method with auxiliary randomization techniques as the method of exchangeable pairs (Stein, 1986) and using the so-called generator interpretation (Barbour, 1988), the Stein-Chen method has had a tremendous impact in the field of probability theory. Its applications range from Poisson point process approximation (Barbour and Brown, 1992), to normal approximation (Bolthausen, 1984, Götze, 1991), eigenfunctions of the Laplacian on a manifold (Meckes, 2009a), logarithmic combinatorial structures (Arratia et al., 2003), diffusion approximation (Barbour, 1990), statistical mechanics (Chiarini et al., 2015, Eichelsbacher and Reinert, 2008), Wiener chaos and Malliavin calculus (Nourdin and Peccati, 2009, 2010). For a more extensive overview of the method, we refer the reader to Barbour and Chen (2005, 2014). In this article we are interested in comparing conditional distributions. That is, given two laws on the same probability space , an observable and we aim at bounding the distance . This task may be quite demanding, even when the non-conditional laws are well-understood. Here, relying on some simple but quite general observations on conditioning, we propose a way of adapting Stein’s method to conditional laws. In particular, we obtain a fairly general scheme to construct a characteristic (Stein) operator for provided that the behavior of under certain information preserving transformations is known. The final estimates, obtained with the classical Stein’s method, are quantitative. Thus they are very useful when one wants to implement simulations of stochastic processes with a precise error rate. We will see one such example concerning random walk bridges where we characterize the measure of the bridge as the invariant distribution of a stochastic process on path space. One can in principle use such dynamics, and the related estimates for convergence to equilibrium, to sample the distribution of the bridge.

To keep our paper self-contained and better explain this procedure, let us recall some basic notions on the Stein’s method.

1.1. Generalities on Stein’s method

We consider a probability metric of the form

(1.1)

where , are probability measures on a Polish space with the Borel -algebra , and is a set of real valued functions large enough so that is indeed a metric. Natural choices for are the set of indicator functions of measurable subsets of , which gives the total variation distance, and the set of -Lipschitz functions, which defines the first order Wasserstein-Kantorovich distance. Next, we consider a probability measure on which is completely characterized by a certain operator acting on a class of functions from to . That is,

if and only . The operator is called characteristic operator, or Stein’s operator.

Now suppose that we are able to solve the following equation for any given datum :

(1.2)

and call the solution . Then, by integrating (1.2) with respect to and taking the supremum for , we obtain

(1.3)

A closer look at (1.3) tells us that we will be able to estimate the distance between and by a careful analysis of the Stein’s operator. Of course, all this discussion is worth only if the right hand side of (1.3) is easier to bound than the left hand side, which turns out to be often the case.

Observe that the mere fact that we ask for existence of solutions to (1.2) for tells us that the operator is characterizing for . Indeed,

which implies , since otherwise as defined above would not be a metric.

Remark 1.1.

The method becomes particularly effective when both measures have a characterizing operator and one is a “perturbation” of the other. Say that is characterized by and by . Then using that we get

which tells us that and are close if their characterizing operators are close.

1.2. Outline of the method

To keep things simple, we assume in this introduction to be at most countable, that the support of is and that . However, at the price of additional technicalities, the same principles remain valid in more general setups, such as those considered in this article. In this section we do not make rigorous proofs but rather give some general ideas, which then have to be implemented ad hoc in the cases of interest.

  1. If is such that , then -almost surely we have

    (1.4)

    Thus, if we have the Radon–Nikodym derivative for the unconditional laws, we also have it for the conditional laws upon the computation of a normalization constant. This can be shown rigorously and we refer the reader to Pap and van Zuijlen (1996, Lemma 1) for a precise statement of (1.4). Although the normalisation constant in (1.4) may be quite hard to compute, such computation is never required for our method to work.

  2. If is an injective transformation which preserves the information, i.e.

    then -almost surely we have

    We can rephrase this by saying that if one has a change-of-measure formula for under , i.e. for all bounded

    (1.5)

    then the same formula is valid for the the conditional law

    Indeed, as it easy to see, , where is a left inverse of .

Let us now see how 1 and 2 are useful for our purposes. We assume that is a family of injective transformations of such that that for all the change of measure formula (1.5) is known explicitly for . For instance, one might think to the case when is the Wiener measure and is the family of translations by Cameron-Martin paths.

Then, by concatenating different formulas, it is possible to deduce (1.5) for , where

(1.6)

In the example of Brownian motion, obviously . However, there are situations where , and the elements in are those which we use for the construction of the characteristic operator. A toy example for this is given in Subsection 1.3; more elaborate examples are in Section 3.
If is the subset of transformations which preserve the observation, then 2 tells that for all bounded and

(1.7)

If is large enough to span the whole space, in the sense that for all with there exist such that

(1.8)

then (1.7) together with the obvious requirement that is indeed a characterization of . Clearly, the smaller , the better the characterization. The construction of a characteristic operator is now straightforward and follows a kind of “randomization” procedure. That is, for fixed we consider (1.7) with , and then sum over . We arrive at

for all functions . Thus, the characteristic operator is

(1.9)

which is the generator of a continuous Markov chain whose dynamics is the following:

  • once at state , the chain waits for an exponential random time of parameter and then jumps to a new state.

  • The new state is chosen according to the following law:

Once the characteristic operator has been found, it is possible to follow the classical ideas of Stein’s method to bound the distance between the conditional laws. Let us remark that the explicit description of the dynamics associated to the Markov generator turns out to be very useful in order to bound the derivatives of the solution to by means of couplings. In Section 3 in the context of random walk bridges we construct some ad hoc couplings, which may be of independent interest and, we believe, are among the novelties of this article.

The use of observation 1 is to “bootstrap” a characteristic operator for a conditional distribution provided we know one for another, typically simpler, conditional distribution. Indeed, assume the knowledge of the density and of a characteristic operator for in the form (1.9). Since satisfies a kind of product rule

(1.10)

with

then we can write, for all ,

where we used that is the reversible measure for in order to write in the third equality. Thus, the operator is a characteristic operator for . Clearly, the operator in (1.10) depends a lot on the underlying space and on the operator , and is typically easier to handle in continuous rather than discrete spaces. For instance, when is a diffusion operator, it is well known that , so that . In Section 2 we will use the procedure just described in the context of filtering.

Conditional equivalence

Let us reformulate 1 in a slightly more accurate way.

  1. If is such that , and takes the form

    for some and , then -almost surely we have

In addition to what could be deduced from 1, we can see that there may be different probabilities whose conditional laws are equal. It suffices that the density is measurable with respect to the observation, i.e.

for some . This is not so surprising, since conditioning is often seen as a kind of projection. Several explicit examples of conditional equivalence are known, especially for bridges, see for instance Benjamini and Lee (1997)Clark (1991)Conforti and Léonard (2016)Fitzsimmons (1998). These considerations suggest that whatever bound is obtained for conditional probabilities, it has to be compatible with this equivalence in order to be satisfactory. That is, if it is of the form

for some metric on the space of probability measures, then the “function” has to be such that

whenever are conditionally equivalent in the sense above. A nice feature of the bounds we propose in this article is that they comply with the compatibility requirement.

1.3. A toy example: Poisson conditioned on the diagonal

To illustrate more concretely the previous ideas we shall describe the special case of a two-dimensional vector with Poisson components conditioned to be on the diagonal of . Even though the computations are quite straightforward, this example can be considered paradigmatic, since it contains the key ideas behind our method.

Finding the characteristic operator

Let and so that for this example . Let us set the observable . We are interested in . Notice that such conditional law can be computed explicitly:

(1.11)

with the modified Bessel function of the first kind. However the knowledge of the distribution will not be needed below. Our goal is to find a characteristic operator for the conditional probability exploiting observation 2. For this, consider the family of injections with , for . The change-of-measure formulas (1.5) are well known, see (Chen, 1975a):

(1.12)

for every bounded function . However, neither nor are information preserving, i.e. for . This is an example where iterating the formulas (1.12) helps in producing new ones, which in turn can be used to characterize the conditional law. Indeed, in the current setup, defined in (1.6) is

and preserves the information when . The change of measure formula for under is easily derived concatenating (1.12) for : for all bounded

Moreover, one can check that the set is connecting in the sense of (1.8). Thus, the conditional law is characterized by the change of measure formula

for bounded, and the Stein’s operator for it is

With a slight abuse of notation we identify with its push-forward through the map and regard it as a measure on . In this case acts on bounded and reads

(1.13)

i.e. is the generator of a birth-death chain with birth rate and death rate .

Bounding the distance

Assume we have two other parameters , that and that we wish to bound, say, the 1-Wasserstein distance . This situation falls in the framework of Remark 1.1; indeed a characteristic operator for can be obtained as we did for . Therefore we can deduce the following result.

Lemma 1.2.

For all we have

Proof.

Let be the solution of the Stein’s equation with input datum a 1-Lipschitz function . We can bound

where is the generator (1.13) with in place of . Hence

(1.14)

In the last line we have used the bound on the gradient of the Stein solution ; this bound can be deduced from Proposition 3.14, which we prove later on in the article with a coupling argument. ∎

Finally, let us observe that the bound obtained is compatible with what is known about conditional equivalence and mentioned in 1. Indeed, in Conforti (2015, Example 4.3.1) it is shown that if and only if .

Structure of the paper.

The paper consists of two main parts. Section 2 is devoted to the study of the classical one-dimensional filtering problem. We present the setup and preparatory results in Subsections 2.1-2.3 and show our main Theorems in Subsections 2.4 and 2.5.

In Section 3 we are concerned with the study of bridges of random walks. We begin by considering random walks on the hypercube in Subsection 3.1. We then pass to the random walk on the euclidean lattice in Subsection 3.2, and extend the results to homogeneous and non-homogeneus (Subsection 3.3) jump rates. We conclude by analysing the speed of convergence of a scheme approximating the continuous-time simple random walk in Subsection 3.4.

Notation

We write . We denote by , for a metric space, the space of càdlàg paths on for the topology induced by . denotes the -Wasserstein distance. When we have a piecewise-constant trajectory we use the notation . The set of smooth and bounded functions on a set is called . For functions , we will use the abbreviation . The set of non-negative reals is called . The set of all probability measures on a measurable space shall be denoted by . The maximum between is denoted as , and the minimum . Given two measurable spaces and the notation denotes the push-forward of the measure through the measurable function .

2. Filtering problem

2.1. Setup and main result

The model.

In this Section, we consider the filtering problem in one dimension (see for example Øksendal (2013, Chapter 6)). In this classical problem, one is interested in estimating the state of a -dimensional diffusion process, the signal, given the trajectory of another stochastic process, the observation, which is obtained applying a random perturbation to the signal. More precisely, fix a time horizon , and denote by the set of continuous functions defined on with values in and vanishing at zero. We denote by the canonical process in which we endow with the canonical filtration.

We consider a first system of signal and observation whose law on the space is governed by the system of equations

(2.1)

We will call such a system the linear one. We then consider a second system whose law is characterized by the SDE

(2.2)

and call it the non-linear system. Here are one-dimensional independent standard Brownian motions under resp. . In filtering one is concerned with the study of the conditional laws , defined by

where lies in a subset of such that both conditional laws are well defined. It is not hard to see that there exists a subset of measure for the Wiener measure where can be chosen. Typical quantities of interest are the conditional mean, also known as filter, and the conditional variance. Since explicit calculations can be done only for the linear case and few others, it is common in applications to approximate systems as (2.2) through linear ones such as (2.1). This allows chiefly to “forget” the drift which naturally complicates the control on the conditional laws.

Quantifying the error in the linear approximation.

Our goal is to understand how big the error we are making in neglecting the drift is. Thus, for a given we aim at finding bounds for , where is the 1-Wasserstein distance associated to the supremum norm on . Although some assumptions on have to be made to provide concrete bounds, we stress that our aim is to look at cases outside the asymptotic regime where is a small perturbation. Actually, our analysis covers up to the case when grows sublinearly. Since we work with distances on the path space, our results allow to go well beyond the one-point marginals, and they apply to a much wider class of functionals than the conditional mean. What we can say is that, under a sublinear growth assumption on , the approximation can be explicitly given and depends on the behavior of the drift and its derivatives up to second order (Theorems 2.1-2.2).

Notation.

We shall denote by the mean of the Gaussian process , that is, and by its covariance, i.e. . When , we simply write . We define to be the centered version of , that is, . Finally, we use the constant to normalize a measure, and note that it may vary from occurrence to occurrence. It will be clear from the context that we are not referring to the observation process .

Main result.

We assume that is twice continuosly differentiable and

  1. there exists a constant and such that for all

    (2.3)
  2. there exists a constant such that .

In the following results we shall distinguish between the cases and . Under the above stated conditions we are able to prove the following bounds.

Theorem 2.1.

Let . Almost surely in the random observation

(2.4)

where

  1. ,

  2. is the maximal positive root of the polynomial

    (2.5)

    with

    (2.6)
Theorem 2.2.

Let . Almost surely in the random observation

where

  1. the constants are defined by

    (2.7)
  2. is the largest positive root of the polynomial

    with being defined by

    and

Remark 2.3 (The bound is explicit).

The bounds in Theorem 2.1 and Theorem 2.2 are given in terms of the conditional mean and covariances for the linear system, and the constants from the hypothesis. Note that the functions and can be calculated explicitly, using Hairer et al. (2005, Theorem 4.1 and Lemma 4.3). We have

Some explanation is due concerning and . For , some simple algebraic manipulations allow to get explicit bounds as a function of and . Concerning , we observe that it is independent of and that, drawing from the large literature about maxima of centered Gaussian random variables, several bounds for it can be derived. Thus the estimate in Theorem 2.1 is totally explicit.

Remark 2.4 (A remark on the density bounds of Zeitouni (1988)).

In the vast literature on filtering, especially relevant to our work is Zeitouni (1988, Theorem 1 and following Remark) which proves density bounds for the unnormalised one-time marginal density. These may be in fact an alternative starting point to prove approximation results as the ones we present. Although these bounds are available in a more general setting than the one considered in Theorem 2.1, to obtain a quantitative result one must deal with the normalisation constant and estimate it. Typically good bounds for such constant are very hard to obtain unless one works in an asymptotic regime whereas our approach is independent of normalisations, as pointed out in the Introduction. Moreover, our approximation results cover more than the one-time marginals.

Outline of the proof

The proof is done comparing a Stein operator for the linear and the non-linear filter, following Remark 1.1. Since the covariance structure and mean of the Gaussian process can be given explicitly, a Stein operator is readily obtained following Meckes (2009b). However, for the sake of completeness, we will also provide an alternative derivation of this result, following point 1. A Stein operator for can then be obtained from a Stein operator for and Girsanov theorem, thus following 2. Once we have the Stein operators, we need to estimate their difference, which involves studying the moments of the canonical process under . Note that Stein operators for both the linear and non linear filter may be deduced from Hairer et al. (2005), Hairer et al. (2007); however, we will work with different characteristic operators, which naturally generalize the finite-dimensional approach of Meckes (2009b). We will distinguish our result into two cases, according to the exponent being larger or smaller than . This is due to the fact that for the quantity is bounded, and therefore only an estimate on the one-time marginal is needed. In the complementary case instead, the estimate involves the whole trajectory, therefore we have to introduce a norm on the path space to evaluate the required moments.

2.2. Linear filter

For the linear case many results are already at our disposal. We think chiefly of Hairer et al. (2005), which gives formulas for the conditional mean and covariance, and characterizes as the invariant measure of an SPDE. For the sake of completeness, we would like to sketch how one can obtain the formulas for conditional means and covariances using the observations at the basis of this article. To simplify the exposition we restrict the attention to the finite-dimensional case, determining the conditional distribution of a multivariate Gaussian.

Let , , and be a Gaussian law on . We denote as the typical element of , the inner product on and the inner products on and respectively. The covariance matrix and mean of are, in block form,

Let us also define the matrix , for which we adopt the block notation as well

The following integration-by-parts formula can be seen as the “limit” as of the change of measure (1.5) for . For all directions of differentiation and test functions it holds that (Meckes, 2009b, Lemma 1 (1))

If we want to study , we look at the transformations associated to vectors of the form . Using the notation above, the integration by parts can be rewritten for one such vector as

According to the general paradigma (namely 1), this formula characterises . Upon setting , it holds that

From this we deduce that is a Gaussian with mean and inverse covariance matrix . Using standard results for inverting block matrices we obtain that the mean of is

and its covariance matrix is

The same result is derived in greater generality in Hairer et al. (2005, Lemma 4.3).

2.3. Non-linear filter

2.3.1. Lifting the Stein operator via densities from the linear to the non-linear filter

As we saw in the Introduction, probability ratios are preserved by conditioning, and point 1 informally states that Radon–Nikodym derivatives of conditional measures can be found easily once we know those of the unconditional laws. In the context of the linear and non-linear filter point 1 is translated into the following.

Lemma 2.5 (Girsanov theorem for filters).

The following holds for almost every :

(2.8)

where is a primitive of and .

This Lemma is not an original result of this article, see for instance Zeitouni (1988, Eq. (2.5) and Eq. (2.6)). For this reason, we do not make its proof.

2.3.2. Stein equation for the non-linear filter

Let . We say that a function is -Lipschitz if

Let be the set of smooth cylindrical functionals with bounded second derivative defined by

and let be the set of functions in that are also -Lipschitz. We set for any and for any ,

(2.9)

As a remark, it is immediate to see that any is twice Frechét differentiable in and that the derivatives correspond to those in (2.9).

Recalling that is the mean of , we define for any the operator

(2.10)

where the expectation is taken with respect to , that is,

Lemma 2.6.

In the above setting, the following hold.

  1. satisfies the integration-by-parts formula

    (2.11)

    for all . In particular, for all .

  2. Let be such that . Then the equation

    admits as solution

    (2.12)

    Moreover, .

  3. satisfies the formula

    (2.13)

    for all , where is defined by

Proof.

In the whole proof fix with  such that . Furthermore, set , , and .

Let us start with the proof of 1. If we define it is seen, using (2.9), that