Conjugacy properties of time-evolving Dirichlet and gamma random measures

Conjugacy properties of time-evolving Dirichlet and gamma random measures

Omiros Papaspiliopoulos
ICREA and Universitat Pompeu Fabra
Matteo Ruggiero
University of Torino and Collegio Carlo Alberto
Dario Spanò
University of Warwick
July 3, 2019

We extend classic characterisations of posterior distributions under Dirichlet process and gamma random measures priors to a dynamic framework. We consider the problem of learning, from indirect observations, two families of time-dependent processes of interest in Bayesian nonparametrics: the first is a dependent Dirichlet process driven by a Fleming–Viot model, and the data are random samples from the process state at discrete times; the second is a collection of dependent gamma random measures driven by a Dawson–Watanabe model, and the data are collected according to a Poisson point process with intensity given by the process state at discrete times. Both driving processes are diffusions taking values in the space of discrete measures whose support varies with time, and are stationary and reversible with respect to Dirichlet and gamma priors respectively. A common methodology is developed to obtain in closed form the time-marginal posteriors given past and present data. These are shown to belong to classes of finite mixtures of Dirichlet processes and gamma random measures for the two models respectively, yielding conjugacy of these classes to the type of data we consider. We provide explicit results on the parameters of the mixture components and on the mixing weights, which are time-varying and drive the mixtures towards the respective priors in absence of further data. Explicit algorithms are provided to recursively compute the parameters of the mixtures. Our results are based on the projective properties of the signals and on certain duality properties of their projections.

Keywords: Bayesian nonparametrics, Dawson–Watanabe process, Dirichlet process, duality, Fleming–Viot process, gamma random measure.

MSC Primary: 62M05, 62M20. Secondary: 62G05, 60J60, 60G57.

1 Introduction

1.1 Motivation and main contributions

An active area of research in Bayesian nonparametrics is the construction and the statistical learning of so-called dependent processes. These aim at accommodating weaker forms of dependence than exchangeability among the data, such as partial exchangeability in the sense of de Finetti. The task is then to let the infinite-dimensional parameter, represented by a random measure, depend on a covariate, so that the generated data are exchangeable only conditional on the same covariate value, but not overall exchangeable. This approach was inspired by MacEachern (1999\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; 2000) and has received considerable attention since.

In the context of this article, the most relevant strand of this literature attempts to build time evolution into standard random measures for semi-parametric time-series analysis, combining the merits of flexible exchangeable modelling afforded by random measures with those of mainstream generalised linear and time series modelling. For the case of Dirichlet processes, the reference model in Bayesian nonparametrics introduced by Ferguson (1973), the time evolution has often been built into the process by exploiting its celebrated stick-breaking representation (Sethuraman, 1994). For example, Dunson (2006) models the dependent process as an autoregression with Dirichlet distributed innovations, Caron et al. (2008) models the noise in a dynamic linear model with a Dirichlet process mixture, Caron, Davy and Doucet (2007) develops a time-varying Dirichlet mixture with reweighing and movement of atoms in the stick-breaking representation, Rodriguez and ter Horst (2008) induces the dependence in time only via the atoms in the stick-breaking representation, by making them into an heteroskedastic random walk. See also Caron and Teh (2012)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Caron et al. (2016)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Griffin and Steel (2006)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Gutierrez et al. (2016)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Mena and Ruggiero (2016). The stick-breaking representation of the Dirichlet process has demonstrated its versatility for constructing dependent processes, but makes it hard to derive any analytical information on the posterior structure of the quantities involved. Parallel to these developments, random measures have been combined with hidden Markov time series models, either for allowing the size of the latent space to evolve in time using transitions based on a hierarchy of Dirichlet processes, e.g. Beal, Ghahramani and Rasmussen (2002), Van Gael, Saatci, Teh and Ghahramani (2008), Stepleton, Ghahramani, Gordon and Lee (2009) and Zhang, Zhu and Zhang (2014), or for building flexible emission distributions that link the latent states to data, e.g. Yau, Papaspiliopoulos, Roberts and Holmes (2011), Gassiat and Rousseau (2016).

From a probabilistic perspective, there is a fairly canonical way to build stationary processes with marginal distributions specified as random measures using stochastic differential equations. This more principled approach to building time series with given marginals has been well explored, both probabilistically and statistically, for finite-dimensional marginal distributions, either using processes with discontinuous sample paths, as in Barndorff-Nielsen and Shephard (2001) or Griffin (2011), or using diffusions, as we undertake here. The relevance of measure-valued diffusions in Bayesian nonparametrics has been pioneered in Walker et al. (2007), whose construction naturally allows for separate control of the marginal distributions and the memory of the process.

The statistical models we investigate in this article, introduced in Section 2, can be seen as instances of what we call hidden Markov measures, since the models are formulated as hidden Markov models where the latent, unobserved signal is a measure-valued infinite-dimensional Markov process. The signal in the first model is the Fleming–Viot (FV) process, denoted on some state space (also called type space in population genetics), which admits the law of a Dirichlet process on as marginal distribution. At times , conditionally on , observations are drawn independently from , i.e.,

(1)

Hence, this statistical model is a dynamic extension of the classic Bayesian nonparametric model for unknown distributions of Ferguson (1973) and Antoniak (1974). The signal in the second model is the Dawson–Watanabe (DW) process, denoted and also defined on , that admits the law of a gamma random measure as marginal distribution. At times , conditionally on , the observations are a Poisson process on with random intensity , i.e., for any collection of disjoint sets and ,

Hence, this is a time-evolving Cox process and can be seen as a dynamic extension of the classic Bayesian nonparametric model for Poisson point processes of Lo (1982).

The Dirichlet and the gamma random measures, used as Bayesian nonparametric priors, have conjugacy properties to observation models of the type described above, which have been exploited both for developing theory and for building simulation algorithms for posterior and predictive inference. These properties, reviewed in Sections 2.1.1 and 2.2.1, have propelled the use of these models into mainstream statistics, and have been used directly in simpler models or to build updating equations within Markov chain Monte Carlo and variational Bayes computational algorithms in hierarchical models.

In this article, for the first time, we show that the dynamic versions of these Dirichlet and gamma models also enjoy certain conjugacy properties. First, we formulate such models as hidden Markov models where the latent signal is a measure-valued diffusion and the observations arise at discrete times according to the mechanisms described above. We then obtain that the filtering distributions, that is the laws of the signal at each observation time conditionally on all data up to that time, are finite mixtures of Dirichlet and gamma random measures respectively. We provide a concrete posterior characterisation of these marginal distributions and explicit algorithms for the recursive computation of the parameters of these mixtures. Our results show that these families of finite mixtures are closed with respect to the Bayesian learning in this dynamic framework, and thus provide an extension of the classic posterior characterisations of Antoniak (1974) and Lo (1982) to time-evolving settings.

The techniques we use to establish the new conjugacy results are detailed in Section 4, and build upon three aspects: the characterisations of Dirichlet and gamma random measures through their projections; certain results on measure-valued diffusions related to their time-reversal; and some very recent developments in Papaspiliopoulos and Ruggiero (2014) that relate optimal filtering for finite-dimensional hidden Markov models with the notion of duality for Markov processes, reviewed in Section 4.1.

time propagation

projection

characterisation

time propagation

Figure 1: Scheme of the general argument for obtaining the filtering distribution of hidden Markov models with FV and DW signals, proved in Theorems 3.2 and 3.5. In this figure is the latent measure-valued signal. Given data , the future distribution of the signal at time is determined by taking its finite-dimensional projection onto an arbitrary partition , evaluating the relative propagation at time , and by exploiting the projective characterisation of the filtering distributions.

Figure 1 schematises, from a high level perspective, the strategy for obtaining our results. In a nutshell, the essence of our theoretical results is that the operations of projection and propagation of measures commute. More specifically, we first exploit the characterisation of the Dirichlet and gamma random measures via their finite-dimensional distributions, which are Dirichlet and independent gamma distributions respectively. Then we exploit the fact that the dynamics of these finite-dimensional distributions induced by the measure-valued signals are the Wright–Fisher (WF) diffusion and a multivariate Cox–Ingersoll–Ross (CIR) diffusion. Then, we extend the results in Papaspiliopoulos and Ruggiero (2014) to show that filtering these finite-dimensional signals on the basis of observations generated as described above results in mixtures of Dirichlet and independent gamma distributions. Finally, we use again the characterisations of Dirichlet and gamma measures via their finite-dimensional distributions to obtain the main results in this paper, that the filtering distributions in the Fleming–Viot model evolves in the family of finite mixtures of Dirichlet processes and those in the Dawson–Watanabe model in the family of finite mixtures of gamma random measures, under the observation models considered. The validity of this argument is formally proved in Theorems 3.2 and 3.5. The resulting recursive procedures for Fleming–Viot and Dawson–Watanabe signals that describe how to compute the parameters of the mixtures at each observation time are given in Propositions 3.3 and 3.6, and the associated pseudo codes are outlined in Algorithms 1 and 2.

The paper is organised as follows. Section 1.2 briefly introduces some basic concepts on hidden Markov models. Section 1.3 provides a simple illustration of the underlying structures implied by previous results on filtering one-dimensional WF and CIR processes. These will be the reference examples throughout the paper and provide relevant intuition on our main results in terms of special cases, since the WF and CIR model are the one-dimensional projections of the infinite-dimensional families we consider here. Section 2 describes the two families of dependent random measures which are the object of this contribution, the Fleming–Viot and the Dawson–Watanabe diffusions, from a non technical viewpoint. Connections of the dynamic models with the their marginal or static sub-cases given by Dirichlet and gamma random measures, well known in Bayesian nonparametrics, are emphasised. Section 3 exposes and discusses the main results on the conjugacy properties of the two above families, given observation models as described earlier, together with the implied algorithms for recursive computation. All the technical details related to the strategy for proving the main results and to the duality structures associated to the signals are deferred to Section 4.

1.2 Hidden Markov models

Since our time-dependent Bayesian nonparametric models are formulated as hidden Markov models, we introduce here some basic related notions. A hidden Markov model (HMM) is a double sequence where is an unobserved Markov chain, called latent signal, and are conditionally independent observations given the signal. Figure 2 provides a graphical representation of an HMM. We will assume here that the signal is the discrete time sampling of a continuous time Markov process with transition kernel . The signal parametrises the law of the observations , called emission distribution. When this law admits density, this will be denoted by .

\SetGraphUnit\GraphInit\SetGraphShadeColor\SetGraphUnit\Vertex\Vertex\Vertex\Vertex\Vertex\Vertex\Vertex\Vertex\Edge\Edge\Edge\Edge\Edge\Edge\Edge
Figure 2: Hidden Markov model represented as a graphical model.

Filtering optimally an HMM requires the sequential exact evaluation of the so-called filtering distributions , i.e., the laws of the signal at different times given past and present observations. Denote and let be the prior distribution for . The exact or optimal filter is the solution of the recursion

(2)

This involves the following two operators acting on measures: the update operator, which in case a density exists takes the form

(3)

and the prediction operator

(4)

The update operation amounts to an application of Bayes’ Theorem to the currently available distribution conditional on the incoming data. The prediction operator propagates forward the current law of the signal of time according to the transition kernel of the underlying continuous-time latent process. The above recursion (2) then alternates update given the incoming data and prediction of the latent signal as follows:

If has prior , then is the posterior conditional on the data observed at time ; is the law of the signal at time obtained by propagating of a interval and conditioning on the data observed at time and ; and so on.

1.3 Illustration for CIR and WF signals

In order to appreciate the ideas behind the main theoretical results and the Algorithms we develop in this article, we provide some intuition on the corresponding results for one-dimensional hidden Markov models based on Cox–Ingersoll–Ross (CIR) and Wright–Fisher (WF) signals. These are the one-dimensional projections of the DW and FV processes respectively, so informally we could say that a CIR process stands to a DW process as a gamma distribution stands to a gamma random measure, and a one-dimensional WF stands to a FV process as a Beta distribution stands to a Dirichlet process. The results illustrated in this section follow from Papaspiliopoulos and Ruggiero (2014) and are based on the interplay between computable filtering and duality of Markov processes, summarised later in Section 4.1. The developments in this article rely on these results, which are extended to the infinite-dimensional case. Here we highlight the mechanisms underlying the explicit filters with the aid of figures, and postpone the mathematical details to Section 4.

First, let the signal be a one-dimensional Wright–Fisher diffusion on [0,1], with stationary distribution (see Section 2.1.2) which is also taken as the prior for the signal at time 0. The signal can be interpreted as the evolving frequency of type-1 individuals in a population of two types whose individuals generate offspring of the parent type which may be subject to mutation. The observations are assumed to be Bernoulli with success probability given by the signal. Upon observation of , assuming it gives type-1 and type-2 individuals with , the prior is updated as usual via Bayes’ theorem to . Here is the update operator (3). A forward propagation of these distribution of time by means of the prediction operator (4) yields the finite mixture of Beta distributions

whose mixing weights depend on (see Lemma 4.1 below for their precise definition). The propagation of at time thus yields a mixture of Beta’s with components. The Beta parameters range from , which represent the full information provided by the collected data, to , which represent the null information on the data so that the associated component coincides with the prior. It is useful to identify the indices of the mixture with the nodes of a graph, as in Figure 3-(b), where the red node represent the component with full information, and the yellow nodes the other components, including the prior identified by the origin.

\GraphInit\SetGraphShadeColor\SetGraphUnit\Vertex\EA\EA\EA\EA\Edge\Edge\Edge\Edge

\GraphInit\SetGraphShadeColor\SetGraphUnit\Vertex\SOWE\SOEA\SOEA\SOEA\SOWE\NOEA\NOEA\NOEA\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge\Edge
Figure 3: The death process on the lattice modulates the evolution of the mixture weights in the filtering distributions of models with CIR (left) and WF (right) signals. Nodes on the graph identify mixture components in the filtering distribution. The mixture weights are assigned according to the probability that the death process starting from the (red) node which encodes the current full information (here for the CIR and for the WF) is in a lower node after time .

The time-varying mixing weights are the transition probabilities of an associated (dual) 2-dimensional death process, which can be thought of as jumping to lower nodes in the graph of Figure 3-(b) at a specified rate in continuous time. The effect on the mixture of these weights is that as time increases, the probability mass is shifted from components with parameters close to the full information , to components which bear less to none information on the data. The mass shift reflects the progressive obsolescence of the data collected at as evaluated by signal law at time as t increases.

Note that it is not obvious that (4) yields a finite mixture when is the transition operator of a WF process, since has an infinite series expansion (see Section 2.1.2). This has been proved rather directly in Chaleyat-Maurel and Genon-Catalot (2009) or by combining results on optimal filtering with some duality properties of this model (see Papaspiliopoulos and Ruggiero (2014) or Section 4 here).

Consider now the model where the signal is a one-dimensional CIR diffusion on , with gamma stationary distribution (and prior at ) given by (see Section 2.2.2). The observations are Poisson with intensity given by the current state of the signal. Before data are collected, the forward propagation of the signal distribution to time yields the same distribution by stationarity. Upon observation, at time , of Poisson data points with total count , the prior is updated via Bayes’ theorem to

(5)

yielding a jump in the measure-valued process; see Figure 4-(a). A forward propagation of yields the finite mixture of gamma distributions

(6)

whose mixing weights also depend on (see Lemma 4.3 below for their precise definition). At time , the filtering distribution is a -components mixture with the first gamma parameter ranging from full () to null () information with respect to the collected data (Figure 4-(b)). The time-dependent mixture weights are the transition probabilities of a a certain associated (dual) one-dimensional death process, which can be thought of as jumping to lower nodes in the graph of Figure 3-(a) at a specified rate in continuous time.

Figure 4: Temporal evolution of the filtering distribution (solid black in right panels and marginal rightmost section of left panels) under the CIR model: (a) until the first data collection the propagation preserves the prior/stationary distribution (red dotted in right panels); at the first data collection, the prior is updated to the posterior (blue dotted in right panels) via Bayes’ Theorem, determining a jump in the filtering process (left panel); (b) the forward propagation of the filtering distribution behaves as a finite mixture of Gamma densities (weighted components dashed coloured in right panel); (c) in absence of further data, the time-varying mixing weights shift mass towards the prior component, and the filtering distribution eventually returns to the stationary state.

Similarly to the WF model, the mixing weights shift mass from components whose first parameter is close to the full information, i.e. , to components which bear less to none information . The time evolution of the mixing weights is depicted in Figure 5, where the cyan and blue lines are the weights of the components with full and no information on the data respectively.

Figure 5: Evolution of the mixture weights which drive the mixture distribution in Fig. 4. At the jump time 100 (the origin here), the mixture component with full posterior information (blue dotted in Fig. 4) has weight equal to 1 (cyan curve), and the other components have zero weight. As the filtering distribution is propagated forward, the weights evolve as transition probabilities of an associated death process. The mixture component equal to the prior distribution (red dotted in Fig. 4), which carries no information on the data, has weight (blue curve) that is 0 at the jump time when the posterior update occurs, and eventually goes back to 1 in absence of further incoming observations, in turn determining the convergence of the mixture to the prior in Fig. 4.

As a result of the impact of these weights on the mixture, the latter converges, in absence of further data, to the prior/stationary distribution as increases, as shown in Figure 4-(c). Unlike the WF case, in this model there is a second parameter controlled by a deterministic (dual) process on which subordinates the transitions of the death process; see Lemma 4.3. Roughly speaking, the death process on the graph controls the obsolescence of the observation counts , whereas the deterministic process controls that of the sample size . At the update time we have as in (5), but is a deterministic, continuous and decreasing process, and in absence of further data converges to 0 as , to restore the prior parameter in the limit of (6). See Lemma 4.3 in the Appendix for the formal result for the one-dimensional CIR diffusion.

When more data samples are collected at different times, the update and propagation operations are alternated, resulting in jump processes for both the filtering distribution and the deterministic dual (Figure 6).

Figure 6: Evolution of the filtering distribution (a) and of the deterministic component of the dual process that modulates the sample size parameter in the mixture components, in the case of multiple data collection at time .

1.4 Preliminary notation

Although most of the notation is better introduced in the appropriate places, we collect here that which is used uniformly over the paper, to avoid recalling these objects several times throughout the text. In all subsequent sections, will denote a locally compact Polish space which represents the observations space, is the associated space of finite Borel measures on and its subspace of probability measures. A typical element will be such that

(7)

where is the total mass of , and is sometimes called centering or baseline distribution. We will assume here that has no atoms. Furthermore, for as above, will denote the law on of a Dirichlet process, and that on of a gamma random measure, with . These will be recalled formally in Sections 2.1.1 and 2.2.1.

We will denote by the Fleming–Viot process and by the Dawson–Watanabe process, to be interpreted as and when written without argument. Hence and take values in the space of continuous functions from to and respectively, whereas discrete measures and will denote the marginal states of and . We will write and for their respective one dimensional projections onto the Borel set . We adopt boldface notation to denote vectors, with the following conventions:

where the dimension will be clear from the context unless specified. Accordingly, the Wright–Fisher model, closely related to projections of the Fleming–Viot process onto partitions, will be denoted . We denote by the vector of zeros and by the vector whose only non zero entry is a 1 at the th coordinate. Let also “” define a partial ordering on , so that if for all and for some . Finally, we will use the compact notation for vectors of observations .

2 Hidden Markov measures

2.1 Fleming–Viot signals

2.1.1 The static model: Dirichlet processes and mixtures thereof

The Dirichlet process on a state space , introduced by Ferguson (1973) (see Ghosal (2010) for a recent review), is a discrete random probability measure . The process admits the series representation

(8)

where and are independent and are the jumps of a gamma process with mean measure . We will denote by , the law of in (8), with as in (7).

Mixtures of Dirichlet processes were introduced in Antoniak (1974). We say that is a mixture of Dirichlet processes if

where denotes the measure conditionally on , or equivalently

(9)

With a slight abuse of terminology we will also refer to the right hand side of the last expression as a mixture of Dirichlet processes.

The Dirichlet process and mixtures thereof have two fundamental properties that are of great interest in statistical learning (Antoniak, 1974):

  • Conjugacy: let be as in (9). Conditionally on observations , we have

    where is the conditional distribution of given . Hence a posterior mixture of Dirichlet processes is still a mixture of Dirichlet processes with updated parameters.

  • Projection: let be as in (9). For any measurable partition of , we have

    where and denotes the Dirichlet distribution with parameter .

Letting be concentrated on a single point of recovers the respective properties of the Dirichlet process as special case, i.e.  and imply respectively that and , where .

2.1.2 The Fleming–Viot process

Fleming–Viot (FV) processes are a large family of diffusions taking values in the subspace of given by purely atomic probability measures. Hence they describe evolving discrete distributions whose support also varies with time and whose frequencies are each a diffusion on . Two states apart in time of a FV process are depicted in Figure 7. See Ethier and Kurtz (1993) and Dawson (1993) for exhaustive reviews. Here we restrict the attention to a subclass known as the (labelled) infinitely many neutral alleles model with parent independent mutation, henceforth for simplicity called the FV process, which has the law of a Dirichlet process as stationary measure (Ethier and Kurtz, 1993, Section 9.2).

Figure 7: Two states of a FV process on at successive times (solid discrete measures): (a) the initial state has distribution with (dotted); (b) after some time, the process reaches the stationary state, which has distribution with (dashed).

One of the most intuitive ways to understand a FV process is to consider its transition function, found in Ethier and Griffiths (1993). This is given by

(10)

where denotes the -fold product measure and is a posterior Dirichlet process as defined in the previous section. The expression (10) has a nice interpretation from the Bayesian learning viewpoint. Given the current state of the process , with probability an -sized sample from is taken, and the arrival state is sampled from the posterior law . Here is the probability that an -valued death process which starts at infinity at time 0 is in at time , if it jumps from to at rate . See Tavaré (1984) for details. Hence a larger implies sampling a lower amount of information from with higher probability, resulting in fewer atoms shared by and . Hence the starting and arrival states have correlation which decreases in as controlled by . As , infinitely many samples are drawn from , and will coincide with , and the trajectories are continuous in total variation norm (Ethier and Kurtz, 1993). As , the fact that the death process which governs the probabilities in (10) is eventually absorbed in 0 implies that as , so is sampled from the prior . Therefore this FV is stationary with respect to (in fact, it is also reversible). It follows that, using terms familiar to the Bayesian literature, under this parametrisation the FV can be considered as a dependent Dirichlet process with continuous sample paths. Constructions of Fleming–Viot and closely related processes using ideas from Bayesian nonparametrics have been proposed in Walker et al. (2007)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Favaro, Ruggiero and Walker (2009)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Ruggiero and Walker (2009a\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; b). Different classes of diffusive dependent Dirichlet processes or related are constructed in Mena and Ruggiero (2016)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Mena et al. (2011) based on the stick-breaking representation (Sethuraman, 1994).

Projecting a FV process onto a measurable partition of yields a -dimensional Wright–Fisher (WF) diffusion , which is reversible and stationary with respect to the Dirichlet distribution , for , . See (Dawson, 2010\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Etheridge, 2009). This property is the dynamic counterpart of the projective property of Dirichlet processes discussed in Section 2.1.1. Consistently, the transition function of a WF process is obtained as a specialisation of the FV case, yielding

(11)

with analogous interpretation to (10). See Ethier and Griffiths (1993).

For statistical modelling it is useful to introduce a further parameter that controls the speed of the process. This can be done by defining the time change with . In such parameterisation, does not affect the stationary distribution of the process, and can be used to model the dependence structure.

2.2 Dawson–Watanabe signals

2.2.1 The static model: gamma random measures and mixtures thereof

Gamma random measures (Lo, 1982) can be thought of as the counterpart of Dirichlet processes in the context of finite intensity measures. A gamma random measure with shape parameter as in (7) and rate parameter , denoted , admits representation

(12)

with as in (8).

Similarly to the definition of mixtures of Dirichlet processes (Section 2.1.1), we say that is a mixture of gamma random measures if , and with a slight abuse of terminology we will also refer to the right hand side of the last expression as a mixture of gamma random measures. Analogous conjugacy and projection properties to those seen for mixtures of Dirichlet processes hold for mixtures of gamma random measures:

  • Conjugacy: let be a Poisson point process on with random intensity measure , i.e., conditionally on , for any measurable partition of , . Let , and given , let be a realisation of points of , so that

    (13)

    where is the total mass of . Then

    (14)

    where is the conditional distribution of given . Hence mixtures of gamma random measures are conjugate with respect to Poisson point process data.

  • Projection: for any measurable partition of , we have

    where and denote the gamma distribution with shape and rate .

Letting be concentrated on a single point of recovers the respective properties of gamma random measures as special case, i.e.  and as in (13) imply , and the vector has independent components with gamma distribution , .

Finally, it is well known that (8) and (12) satisfy the relation in distribution

(15)

where is independent of . This extends to the infinite dimensional case the well known relationship between beta and gamma random variables. See for example Daley and Vere-Jones (2008), Example 9.1(e). See also Konno and Shiga (1988) for an extension of (15) to the dynamic case concerning FV and DW processes, which requires a random time change.

2.2.2 The Dawson–Watanabe process

Dawson–Watanabe (DW) processes can be considered as dependent models for gamma random measures, and are, roughly speaking, the gamma counterpart of FV processes. More formally, they are branching measure-valued diffusions taking values in the space of finite discrete measures. As in the FV case, they describe evolving discrete measures whose support varies with time and whose masses are each a positive diffusion, but relaxing the constraint of their masses summing to one to that of summing to a finite quantity. See Dawson (1993) and Li (2011) for reviews. Here we are interested in the special case of subcritical branching with immigration, where subcriticality refers to the fact that in the underlying branching population which can be used to construct the process, the mean number of offspring per individual is less than one. Specifically, we will consider DW processes with transition function

(16)

where

See Ethier and Griffiths (1993b). The interpretation of (16) is similar to that of (10): conditional on the current state, i.e. the measure , iid samples are drawn from the normalised measure and the arrival state is sampled from . Here the main structural difference with respect to (10), apart from the different distributions involved, is that since in general is not an integer quantity, the interpretation as sampling the arrival state from a posterior gamma law is not formally correct; cf. (14). The sample size is chosen with probability , which is the probability that an -valued death process which starts at infinity at time 0 is in at time , if it jumps from to at rate . See Ethier and Griffiths (1993b) for details. So and will share fewer atoms the farther they are apart in time. The DW process with the above transition is known to be stationary and reversible with respect to the law of a gamma random measure; cf. (12). See Shiga (1990)\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Ethier and Griffiths (1993b). The Dawson–Watanabe process has been recently considered as a basis to build time-dependent gamma process priors with Markovian evolution in Caron and Teh (2012) and Spanò and Lijoi (2016).

The DW process satisfies a projective property similar to that seen in Section 2.1.2 for the FV process. Let have transition (16). Given a measurable partition of , the vector has independent components each driven by a Cox–Ingersoll–Ross (CIR) diffusion (Cox, Ingersoll and Ross, 1985). These are also subcritical continuous-state branching processes with immigration, reversible and ergodic with respect to a distribution, with transition function

(17)

As for FV and WF processes, a further parameter that controls the speed of the process can be introduced without affecting the stationary distribution. This can be done by defining an appropriate time change that can be used to model the dependence structure.

3 Conjugacy properties of time-evolving Dirichlet and gamma random measures

3.1 Filtering Fleming–Viot signals

Let the latent signal be a FV process with transition function (10). We assume that, given the signal state, observations are drawn independently from , i.e. as in (1) with . Since is almost surely discrete (Blackwell, 1973), a sample from will feature ties among the observations with positive probability. Denote by the distinct values in and by the associated multiplicities, so that . When an additional sample with multiplicities becomes available, we adopt the convention that adds up to the multiplicities of the types already recorded in , so that the total multiplicities count is

(18)

The following Lemma states in our notation the special case of the conjugacy for mixtures of Dirichlet processes which is of interest here; see Section 2.1.1. To this end, let

(19)

be the space of multiplicities of types, with partial ordering defined as in Section 1.4. Denote also by the joint distribution of given when the random measure is marginalised out, which follows the Blackwell–MacQueen Pólya urn scheme (Blackwell and MacQueen, 1973)

Lemma 3.1.

Let , as in (7) and be the mixture of Dirichlet processes

with . Given an additional -sized sample from with multiplicities , the update operator (3) yields

(20)

where

(21)

The updated distribution is thus still a mixture of Dirichlet processes with different multiplicities and possibly new atoms in the parameter measures .

The following Theorem formalises our main result on FV processes, showing that the family of finite mixtures of Dirichlet processes is conjugate with respect to discretely sampled data as in (1) with . For as in (19), let

(22)

be the set of nonnegative vectors lower than or equal to or to those in respectively, with “” defined as in Section (1.4). For example, in Figure 3, and are both given by all yellow and red nodes in each case. Let also

(23)

be the multivariate hypergeometric probability function, with parameters , evaluated at .

Theorem 3.2.

Let be the prediction operator (4) associated to a FV process with transition operator (10). Then the prediction operator yields as -time-ahead propagation the finite mixture of Dirichlet processes

(24)

with as in (22) and where

(25)

with

and as in (23).

The transition operator of the FV process thus maps a Dirichlet process at time into a finite mixture of Dirichlet processes at time . The mixing weights are the transition probabilities of a death process on the dimensional lattice, with being as in (24) the number of distinct values in previous data. The result is obtained by means of the argument described in Figure 1, which is based on the property that the operations of propagating and projecting the signal commute. By projecting the current distribution of the signal onto an arbitrary measurable partition, yielding a mixture of Dirichlet distributions, we can exploit the results for finite dimensional WF signals to yield the associated propagation (Papaspiliopoulos and Ruggiero, 2014). The propagation of the original signal is then obtained by means of the characterisation of mixtures of Dirichlet processes via their projections. See Section 4.2 for a proof. In particular, the result shows that under these assumptions, the series expansion for the transition function (10) reduces to a finite sum.

Iterating the update and propagation operations provided by Lemma 3.1 and Theorem 3.2 allows to perform sequential Bayesian inference on a hidden signal of FV type by means of a finite computation. Here the finiteness refers to the fact that the infinite dimensionality due to the transition function of the signal is avoided analytically, without resorting to any stochastic truncation method for (10), e.g. (Walker, 2007\pgfsys@color@rgb@stroke{0}{0.1}{0.9}\pgfsys@color@rgb@fill{0}{0.1}{0.9}; Papaspiliopoulos and Roberts, 2008), and the computation can be conducted in closed form.

The following Proposition formalises the recursive algorithm that sequentially evaluates the marginal posterior laws of a partially observed FV process by alternating the update and propagation operations, and identifies the family of distributions which is closed with respect to these operations. Define the family of finite mixtures of Dirichlet processes

with as in (19) and for a fixed as in (7). Define also

so that is (18) if are the multiplicities of , and

(26)
Proposition 3.3.

Let be a FV process with transition function (10) and invariant law defined as in Section 2.1.1, and suppose data are collected as in (1) with . Then is closed under the application of the update and prediction operators (3) and (4). Specifically,

(27)

with as in (26),

and

(28)

with

(29)

and as in (25).

Note that the update operation (27) preserves the number of components in the mixture, while the prediction operation (24) increases this number. The intuition behind this point is analogous to the illustration in Section 1.3, where the prior (node ) is update to the posterior (node ) and propagated into a mixture (coloured nodes), with the obvious difference that the recorded number of distinct values is unbounded.

Algorithm 1 describes in pseudo-code the implementation of the filter for FV processes.

Data: at times , , as in (1)
Set prior parameters , ,
[0mm] Initialise
        , , , , , ,
       
For
        Compute data summaries
               read data
              
               distinct values in
              
       
        Update operation
               for
                     
                     
                     [0mm]
              
              
               for
                     
              
              
       
        Propagation operation
               for
                      as in (29)