Alpha-diversity processes and normalized inverse-Gaussian diffusions

Alpha-diversity processes and normalized inverse-Gaussian diffusions

[ [    [ [    [ [ University of Torino, University of Kent and University of Torino M. Ruggiero
S. Favaro
Department of Economics
 and Statistics
University of Torino
Corso Unione Sovietica 218/bis
10134, Torino
Italy
\printeade1
E-mail: \printead*e3
S. G. Walker
Institute of Mathematics, Statistics
 and Actuarial Science
University of Kent
CT2 7NZ, Canterbury
United Kingdom
\printeade2
\smonth9 \syear2010\smonth1 \syear2012
\smonth9 \syear2010\smonth1 \syear2012
\smonth9 \syear2010\smonth1 \syear2012
Abstract

The infinitely-many-neutral-alleles model has recently been extended to a class of diffusion processes associated with Gibbs partitions of two-parameter Poisson–Dirichlet type. This paper introduces a family of infinite-dimensional diffusions associated with a different subclass of Gibbs partitions, induced by normalized inverse-Gaussian random probability measures. Such diffusions describe the evolution of the frequencies of infinitely-many types together with the dynamics of the time-varying mutation rate, which is driven by an -diversity diffusion. Constructed as a dynamic version, relative to this framework, of the corresponding notion for Gibbs partitions, the latter is explicitly derived from an underlying population model and shown to coincide, in a special case, with the diffusion approximation of a critical Galton–Watson branching process. The class of infinite-dimensional processes is characterized in terms of its infinitesimal generator on an appropriate domain, and shown to be the limit in distribution of a certain sequence of Feller diffusions with finitely-many types. Moreover, a discrete representation is provided by means of appropriately transformed Moran-type particle processes, where the particles are samples from a normalized inverse-Gaussian random probability measure. The relationship between the limit diffusion and the two-parameter model is also discussed.

[
\kwd
\doi

10.1214/12-AAP846 \volume23 \issue1 2013 \firstpage386 \lastpage425

\runtitle

Normalized inverse-Gaussian diffusions

{aug}

A]\fnmsMatteo \snmRuggiero\corref\thanksreft1label=e1]matteo.ruggiero@unito.it, B]\fnmsStephen G. \snmWalkerlabel=e2]S.G.Walker@kent.ac.uk and A]\fnmsStefano \snmFavaro\thanksreft1label=e3]stefano.favaro@unito.it

\thankstext

t1Supported by the European Research Council (ERC) through StG “N-BNP” 306406. Also affiliated with Collegio Carlo Alberto, Moncalieri, Italy.

class=AMS] \kwd60J60 \kwd60G57 \kwd92D25. Gibbs partitions \kwdPoisson–Dirichlet \kwdgeneralized gamma \kwdinfinitely-many-neutral-alleles model \kwdtime-varying mutation rate.

1 Introduction

Considerable attention has been devoted recently to a class of diffusion processes which extends the infinitely-many-neutral-alleles model to the case of two parameters. This family takes values in the space

(1)

namely, the closure in of the infinite-dimensional ordered simplex, and is characterized, for constants and , by the second order differential operator

(2)

acting on a certain dense sub-algebra of the space of continuous functions on (throughout the paper denotes Kronecker delta). The diffusion with operator (2) describes the evolution of the allelic frequencies at a particular locus in a large population subject to random genetic drift and mutation, where mutation is jointly driven by the parameters . Ethier and Kurtz (1981) characterized the corresponding process when , whereas the two-parameter family was introduced by Petrov (2009) and further investigated by Ruggiero and Walker (2009) and Feng and Sun (2010). The latter is known to be stationary, reversible and ergodic with respect to the Poisson–Dirichlet distribution with parameters . This was introduced by Pitman (1995) [see also Pitman (1996) and Pitman and Yor (1997)] and extends the Poisson–Dirichlet distribution of Kingman (1975) as follows. Consider a random sequence obtained by means of the so-called stick-breaking scheme

(3)

where and . The vector is said to have the GEM distribution with parameters , while the vector of descending order statistics is said to have the Poisson–Dirichlet distribution with parameters . The latter is also the law of the ranked frequencies of an infinite partition induced by a two-parameter Poisson–Dirichlet random probability measure, which generalizes the Dirichlet process introduced by Ferguson (1973). Two-parameter Poisson–Dirichlet models have found applications in several fields. See, for example, the monographs by Bertoin (2006) for fragmentation and coalescent theory, Pitman (2006) for excursion theory and combinatorics, Teh and Jordan (2010) for machine learning, Lijoi and Prünster (2010) for Bayesian inference and Feng (2010) for population genetics. See also Bertoin (2008), Handa (2009) and Favaro et al. (2009).

The Poisson–Dirichlet distribution and its two parameter extension in turn belong to a larger class of random discrete distributions induced by infinite partitions of Gibbs type. These were introduced by Gnedin and Pitman (2005), and applications include fragmentation and coalescent theory [Bertoin (2006), McCullagh, Pitman and Winkel (2008), Goldschmidt, Martin and Spanò (2008)], excursion theory [Pitman (2003)], statistical physics [Berestycki and Pitman (2007)] and Bayesian nonparametric inference [Lijoi, Mena and Prünster (2005, 2007a, 2007b), Lijoi, Prünster and Walker (2008a)]. See Pitman (2006) for a comprehensive account. See also Griffiths and Spanò (2007), Lijoi, Prünster and Walker (2008b) and Ho, James and Lau (2007).

This paper introduces a class of infinite-dimensional diffusions associated with a different subclass of Gibbs-type partitions, induced by normalized inverse-Gaussian random probability measures. Such discrete distributions, recently investigated by Lijoi, Mena and Prünster (2005), are special cases of generalized gamma processes [Pitman (2003), Lijoi, Mena and Prünster (2007b)], and their intersection with two-parameter Poisson–Dirichlet models is given by the sole case , which corresponds to a normalized stable process with parameter [Kingman (1975)]. The class of diffusions studied in this paper is characterized in terms of the second order differential operator

acting on a dense sub-algebra of , the space of continuous functions on vanishing at infinity, for parameters , with , , and . By comparison with (2), it can be seen that the last two terms of (1) describe the time evolution of the frequencies of infinitely-many types. Common features between (2) and (1) are the variance–covariance terms and the structure of the drift or mutation terms . The distinctive feature with respect to (2) is given by the fact that the positive coefficient varies in time, and is driven by what is termed here -diversity diffusion, whose operator is given by the first two terms of (1). Equivalently, follows the stochastic differential equation

(5)

where is a standard Brownian motion. This can be seen as a particular instance of a continuous-time analog of the notion of -diversity, introduced by Pitman (2003) for Poisson–Kingman models, which include Gibbs-type partitions. An exchangeable random partition of is said to have -diversity if and only if there exists a random variable , with almost surely, such that, as ,

(6)

where is the number of classes of the partition of . The connection between (5) and (6) will become clear in Section 4, where the -diversity diffusion will be explicitly derived.

It is to be noted that (2) is not a special case of (1). Indeed, the only way of making constant is to impose null drift and volatility in (5), which implies . Hence, consistently with the above recalled relation between normalized inverse-Gaussian and Poisson–Dirichlet random measures, (2) and (1) share only the case . Nonetheless, an interesting connection between these classes of diffusions can be stated. In particular, it will be shown that performing the same conditioning operation in a pre-limit particle construction of normalized inverse-Gaussian diffusions yields a particular instance of the two-parameter model.

The paper is organized as follows. Section 2 recalls all relevant definitions, among which are Gibbs-type partitions, the associated generalized Pólya-urn scheme and random probability measures of generalized gamma and normalized inverse-Gaussian types. Section 3 derives some new results on generalized gamma processes which are crucial for the construction. These are concerned with the convergence of the number of species represented only once in the observed sample and with the second order approximation of the weights of the generalized Pólya-urn scheme associated with normalized inverse-Gaussian processes. In Section 4, by postulating simple population dynamics underlying the time change of the species frequencies, we derive the -diversity diffusion for the normalized inverse-Gaussian case, by means of a time-varying analog of (6) with the limit intended in distribution, and highlight its main properties. In Section 5 normalized inverse-Gaussian diffusions are characterized in terms of the operator (1), whose closure is shown to generate a Feller semigroup on , and the associated family of processes is shown to be the limit in distribution of certain Feller diffusions with finitely-many types. Section 6 provides a discrete representation of normalized inverse-Gaussian diffusions, which are obtained as limits in distribution of certain appropriately transformed Moran-type particle processes which model individuals explicitly, jointly with the varying population heterogeneity. Finally, Section 7 shows that conditioning on the -diversity process to be constant, that is, , in a pre-limit version of the particle construction yields, in the limit, the two-parameter model (2) with .

2 Preliminaries

The Poisson–Dirichlet distribution and its two parameter extension belong to the class of random discrete distributions induced by infinite partitions of Gibbs type, introduced by Gnedin and Pitman (2005). An exchangeable random partition of the set of natural numbers is said to have Gibbs form if for any and any such that , for , and , the law of the partition can be written as the product

(7)

Here ,

(8)

is the Pochhammer symbol and the coefficients satisfy the recursive equation

(9)

The law of an exchangeable partition is uniquely determined by the function , called the exchangeable partition probability function, which satisfies certain consistency conditions, which imply invariance under permutations of and coherent marginalization over the th item. Hence, the law of a Gibbs partition is uniquely determined by the family . Furthermore, a random discrete probability measure governing a sequence of exchangeable observations is said to be of a Gibbs type if it induces a partition which can be expressed as in (7). These have associated predictive distributions which generalize the Blackwell and MacQueen (1973) Pólya-urn scheme to

(10)

where is a nonatomic probability measure, are the distinct values observed in with absolute frequencies , and the coefficients and are given by

(11)

with as above. It will be of later use to note that integrating both sides of (2) yields

(12)

also obtained from (9) and (11). Examples of Gibbs-type random probability measures are the Dirichlet process [Ferguson (1973)], obtained, for example, from (2) by setting and in

(13)

the two-parameter Poisson–Dirichlet process [Pitman (1995, 1996)], obtained from (13) with and , the normalized stable process [Kingman (1975)], obtained from (13) with and , the normalized inverse-Gaussian process [Lijoi, Mena and Prünster (2005)] and the normalized generalized gamma process [Pitman (2003), Lijoi, Mena and Prünster (2007b)]. See also Gnedin (2010) for a Gibbs-type model with finitely-many types.

The normalized generalized gamma process is a random probability measure with representation

(14)

whose weights are obtained by means of the normalization

(15)

where are the points of a generalized gamma process, introduced by Brix (1999). This is obtained from a Poisson random process on with mean intensity

with and , so that if is the number of ’s which fall in , then is Poisson distributed with mean . Lijoi, Mena and Prünster (2007b) showed that a generalized gamma random measure defined via (14) and (15), denoted by , where with and , induces a random partition of Gibbs type with coefficients and in (2) given by

(16)

where denotes the upper incomplete gamma function

(17)

Special cases of a generalized gamma process with parameters are the Dirichlet process, obtained by letting and , the normalized stable process, obtained by setting , and the normalized inverse-Gaussian process, obtained by setting .

We conclude the section with a brief discussion of the interpretation of in the context of species sampling with Gibbs-type partitions. Suppose different species have been observed in the first samples from (2). The probability that a further sample is an already observed species is , but this mass is not allocated proportionally to the current frequencies. The ratio of probabilities assigned to any pair of species is

When , the probability of sampling species is proportional to the absolute frequency . However, since for , is increasing in , a value of reallocates some probability mass from type to type , so that, for example, for and we have for , respectively. Thus, has a reinforcement effect on those species that have higher frequency. See Lijoi, Mena and Prünster (2007b) for a more detailed treatment of this aspect.

3 Some results on generalized gamma random measures

In this section we investigate some properties of generalized gamma random measures which will be used in the subsequent constructions. In particular, these regard the convergence of the number of species represented only once in the observed sample, and the second order approximation of the weights of the generalized Pólya-urn scheme associated with normalized inverse-Gaussian processes.

Let be an -sized sample drawn from a generalized gamma process with parameters , let denote the number of distinct species observed in the sample, and let denote the vector of absolute frequencies associated with each observed species. The probability distribution of the random variable , for any , and frequencies such that , is provided by Lijoi, Mena and Prünster (2007b) and coincides with

(18)

where and are as in (8) and (17), respectively. Denote now by the number of species represented times in the sample. Then from equation 1.52 in Pitman (2006) it follows that the distribution of is given by

(19)

for any , and vector , where

The following proposition identifies the speed of convergence of the number of species represented once in the sample. Denote by the generalized factorial coefficient

(20)

where and . See Charalambides [(2005), Chapter 2] for a complete account.

Proposition 3.1

Under the normalized generalized gamma process with parameters , one has

(21)

Moreover,

(22)

where is a strictly positive and almost surely finite random variable with density function

with being the density of a positive stable random variable with parameter .

{pf}

Denote . From (3), for any one has

In particular, by using the definition of generalized factorial coefficient in terms of sum over the set of partitions [see Charalambides (2005), equation 2.62], we have

Therefore, we obtain

(23)

In order to obtain the distribution of the random variable , we can make use of the probability generating function of , denoted . From (3) we have

Therefore, the distribution of is given by

where the last identity is due to the fact that for any . Proposition 3 in Lijoi, Mena and Prünster (2007b) shows that

(24)

almost surely, where is an almost surely positive and finite random variable with density function

with being the density function of a positive stable random variable with parameter . In other terms, according to Definition 3.10 in Pitman (2006), an exchangeable partition of having EPPF (3) has -diversity . A simple application of Lemma 3.11 in Pitman (2006) leads to (22).

A second aspect of generalized gamma random measures we need to address for later use is the approximate behavior of the coefficients in the generalized Pólya urn (2). It is well known that the first order behavior of (16) is that of a normalized stable process, that is,

(25)

also implied by the next result. However, it turns out that for the definition of the diffusion processes which are the object of the next two sections, it is crucial to know the second order approximation. The following proposition, whose proof is deferred to the Appendix, identifies such behavior for the normalized inverse-Gaussian case .

Proposition 3.2

Let and be as in (16). When ,

and

(26)

where and .

4 Alpha-diversity processes

Making use of the results of the previous section, here we construct a one-dimensional diffusion process which can be seen as a dynamic version of the notion of -diversity, recalled in (6), relative to the case of normalized inverse-Gaussian random probability measures. Such diffusion, which will be crucial for the construction of Section 5, is obtained as weak limit of an appropriately rescaled random walk on the integers, whose dynamics are driven by an underlying population process. This is briefly outlined here and will be formalized in Section 6. Consider particles, denoted with for each , where is a Polish space, and denote by the number of distinct values observed in . Let the vector be updated at discrete times by replacing a uniformly chosen coordinate. Conditionally on , the incoming particle will be a copy of one still in the vector, after the removal, with probability , and will be a new value with probability , where and are as in (16) and is the value of after the removal. Denote by the chain which keeps track of the number of distinct types in . Then, letting be the number of clusters of size one in , which, by means of (22) and (24) is approximately for large , the transition probabilities for ,

are asymptotically equivalent to

(27)

for . That is, with probability a cluster of size one is selected and removed, with probability a new species appears and with probability a survivor has an offspring. Note that and are set to be barriers, to render the fact that equals 0 and  when equals 1 and , respectively.

The following theorem finds the conditions under which the rescaled chain converges to a diffusion process on . Here we provide a sketch of the proof with the aim of favoring the intuition. The formalization of the result is contained in the proof of Theorem 6.1, while that of the fact that the limiting diffusion is well defined, that is, the corresponding operator generates a Feller semigroup on an appropriate subspace of , is provided in Corollary 4.1 below.

Throughout the paper denotes the space of continuous functions from to , while denotes convergence in distribution.

Theorem 4.1

Let be a Markov chain with transition probabilities as in (27) determined by a generalized gamma process with and , and define to be such that . Let also be a diffusion process driven by the stochastic differential equation

(28)

where is a standard Brownian motion. If , then

(29)
{pf}

Let . From Proposition 3.2 we can write (27) as follows (for ease of presentation we use and in place of and since it is asymptotically equivalent):

The conditional expected increment of the process is

(30)

Similarly, the conditional second moment of the increment is

(31)

Since , and recalling that almost surely, we have

and

It is easy to check that all conditional th moments of converge to zero for , whence it follows by standard theory [cf., e.g., Karlin and Taylor (1981)] that, as , the process converges in distribution to a diffusion process on with drift and diffusion coefficient .

As anticipated, the second order approximation of is crucial for establishing the drift of the limiting diffusion, as the first order terms cancel. It is interesting to note that when , which yields the normalized stable case, the limiting diffusion reduces to the diffusion approximation of a critical Galton–Watson branching process, also known as the zero-drift Feller diffusion. See, for example, Ethier and Kurtz (1986), Theorem 9.1.3. This also holds approximately for high values of , in which case the drift becomes negligible.

(a)
(b)
(c)
Figure 1: Three sample paths of the random walk , with dynamics as in Theorem 4.1, starting from with , for parameter values: (a) , (b) , (c) . The figures show how influences the dynamic clustering structure in the population.

In order to have some heuristics on the behavior of the -diversity process, Figure 1 shows steps of the random walk with dynamics as in Theorem 4.1, starting from with . The three paths correspond to being equal to 0, 100 and 1000. It is apparent how influences the dynamic clustering structure in the population.

It is well known that when , the point 0 is an absorbing boundary for . The next result provides the boundary classification, using Feller’s terminology, for the case .

Proposition 4.1

Let be as in Theorem 4.1 with . Then the points 0 and are, respectively, an entrance and a natural boundary.

{pf}

The scale function for the process, defined as

(32)

where

and and denote drift and diffusion, equals