Alpha-diversity processes and normalized inverse-Gaussian diffusions
The infinitely-many-neutral-alleles model has recently been extended to a class of diffusion processes associated with Gibbs partitions of two-parameter Poisson–Dirichlet type. This paper introduces a family of infinite-dimensional diffusions associated with a different subclass of Gibbs partitions, induced by normalized inverse-Gaussian random probability measures. Such diffusions describe the evolution of the frequencies of infinitely-many types together with the dynamics of the time-varying mutation rate, which is driven by an -diversity diffusion. Constructed as a dynamic version, relative to this framework, of the corresponding notion for Gibbs partitions, the latter is explicitly derived from an underlying population model and shown to coincide, in a special case, with the diffusion approximation of a critical Galton–Watson branching process. The class of infinite-dimensional processes is characterized in terms of its infinitesimal generator on an appropriate domain, and shown to be the limit in distribution of a certain sequence of Feller diffusions with finitely-many types. Moreover, a discrete representation is provided by means of appropriately transformed Moran-type particle processes, where the particles are samples from a normalized inverse-Gaussian random probability measure. The relationship between the limit diffusion and the two-parameter model is also discussed.
10.1214/12-AAP846 \volume23 \issue1 2013 \firstpage386 \lastpage425
Normalized inverse-Gaussian diffusions
A]\fnmsMatteo \snmRuggiero\corref\thanksreft1label=e1]firstname.lastname@example.org, B]\fnmsStephen G. \snmWalkerlabel=e2]S.G.Walker@kent.ac.uk and A]\fnmsStefano \snmFavaro\thanksreft1label=e3]email@example.com
t1Supported by the European Research Council (ERC) through StG “N-BNP” 306406. Also affiliated with Collegio Carlo Alberto, Moncalieri, Italy.
class=AMS] \kwd60J60 \kwd60G57 \kwd92D25. Gibbs partitions \kwdPoisson–Dirichlet \kwdgeneralized gamma \kwdinfinitely-many-neutral-alleles model \kwdtime-varying mutation rate.
Considerable attention has been devoted recently to a class of diffusion processes which extends the infinitely-many-neutral-alleles model to the case of two parameters. This family takes values in the space
namely, the closure in of the infinite-dimensional ordered simplex, and is characterized, for constants and , by the second order differential operator
acting on a certain dense sub-algebra of the space of continuous functions on (throughout the paper denotes Kronecker delta). The diffusion with operator (2) describes the evolution of the allelic frequencies at a particular locus in a large population subject to random genetic drift and mutation, where mutation is jointly driven by the parameters . Ethier and Kurtz (1981) characterized the corresponding process when , whereas the two-parameter family was introduced by Petrov (2009) and further investigated by Ruggiero and Walker (2009) and Feng and Sun (2010). The latter is known to be stationary, reversible and ergodic with respect to the Poisson–Dirichlet distribution with parameters . This was introduced by Pitman (1995) [see also Pitman (1996) and Pitman and Yor (1997)] and extends the Poisson–Dirichlet distribution of Kingman (1975) as follows. Consider a random sequence obtained by means of the so-called stick-breaking scheme
where and . The vector is said to have the GEM distribution with parameters , while the vector of descending order statistics is said to have the Poisson–Dirichlet distribution with parameters . The latter is also the law of the ranked frequencies of an infinite partition induced by a two-parameter Poisson–Dirichlet random probability measure, which generalizes the Dirichlet process introduced by Ferguson (1973). Two-parameter Poisson–Dirichlet models have found applications in several fields. See, for example, the monographs by Bertoin (2006) for fragmentation and coalescent theory, Pitman (2006) for excursion theory and combinatorics, Teh and Jordan (2010) for machine learning, Lijoi and Prünster (2010) for Bayesian inference and Feng (2010) for population genetics. See also Bertoin (2008), Handa (2009) and Favaro et al. (2009).
The Poisson–Dirichlet distribution and its two parameter extension in turn belong to a larger class of random discrete distributions induced by infinite partitions of Gibbs type. These were introduced by Gnedin and Pitman (2005), and applications include fragmentation and coalescent theory [Bertoin (2006), McCullagh, Pitman and Winkel (2008), Goldschmidt, Martin and Spanò (2008)], excursion theory [Pitman (2003)], statistical physics [Berestycki and Pitman (2007)] and Bayesian nonparametric inference [Lijoi, Mena and Prünster (2005, 2007a, 2007b), Lijoi, Prünster and Walker (2008a)]. See Pitman (2006) for a comprehensive account. See also Griffiths and Spanò (2007), Lijoi, Prünster and Walker (2008b) and Ho, James and Lau (2007).
This paper introduces a class of infinite-dimensional diffusions associated with a different subclass of Gibbs-type partitions, induced by normalized inverse-Gaussian random probability measures. Such discrete distributions, recently investigated by Lijoi, Mena and Prünster (2005), are special cases of generalized gamma processes [Pitman (2003), Lijoi, Mena and Prünster (2007b)], and their intersection with two-parameter Poisson–Dirichlet models is given by the sole case , which corresponds to a normalized stable process with parameter [Kingman (1975)]. The class of diffusions studied in this paper is characterized in terms of the second order differential operator
acting on a dense sub-algebra of , the space of continuous functions on vanishing at infinity, for parameters , with , , and . By comparison with (2), it can be seen that the last two terms of (1) describe the time evolution of the frequencies of infinitely-many types. Common features between (2) and (1) are the variance–covariance terms and the structure of the drift or mutation terms . The distinctive feature with respect to (2) is given by the fact that the positive coefficient varies in time, and is driven by what is termed here -diversity diffusion, whose operator is given by the first two terms of (1). Equivalently, follows the stochastic differential equation
where is a standard Brownian motion. This can be seen as a particular instance of a continuous-time analog of the notion of -diversity, introduced by Pitman (2003) for Poisson–Kingman models, which include Gibbs-type partitions. An exchangeable random partition of is said to have -diversity if and only if there exists a random variable , with almost surely, such that, as ,
It is to be noted that (2) is not a special case of (1). Indeed, the only way of making constant is to impose null drift and volatility in (5), which implies . Hence, consistently with the above recalled relation between normalized inverse-Gaussian and Poisson–Dirichlet random measures, (2) and (1) share only the case . Nonetheless, an interesting connection between these classes of diffusions can be stated. In particular, it will be shown that performing the same conditioning operation in a pre-limit particle construction of normalized inverse-Gaussian diffusions yields a particular instance of the two-parameter model.
The paper is organized as follows. Section 2 recalls all relevant definitions, among which are Gibbs-type partitions, the associated generalized Pólya-urn scheme and random probability measures of generalized gamma and normalized inverse-Gaussian types. Section 3 derives some new results on generalized gamma processes which are crucial for the construction. These are concerned with the convergence of the number of species represented only once in the observed sample and with the second order approximation of the weights of the generalized Pólya-urn scheme associated with normalized inverse-Gaussian processes. In Section 4, by postulating simple population dynamics underlying the time change of the species frequencies, we derive the -diversity diffusion for the normalized inverse-Gaussian case, by means of a time-varying analog of (6) with the limit intended in distribution, and highlight its main properties. In Section 5 normalized inverse-Gaussian diffusions are characterized in terms of the operator (1), whose closure is shown to generate a Feller semigroup on , and the associated family of processes is shown to be the limit in distribution of certain Feller diffusions with finitely-many types. Section 6 provides a discrete representation of normalized inverse-Gaussian diffusions, which are obtained as limits in distribution of certain appropriately transformed Moran-type particle processes which model individuals explicitly, jointly with the varying population heterogeneity. Finally, Section 7 shows that conditioning on the -diversity process to be constant, that is, , in a pre-limit version of the particle construction yields, in the limit, the two-parameter model (2) with .
The Poisson–Dirichlet distribution and its two parameter extension belong to the class of random discrete distributions induced by infinite partitions of Gibbs type, introduced by Gnedin and Pitman (2005). An exchangeable random partition of the set of natural numbers is said to have Gibbs form if for any and any such that , for , and , the law of the partition can be written as the product
is the Pochhammer symbol and the coefficients satisfy the recursive equation
The law of an exchangeable partition is uniquely determined by the function , called the exchangeable partition probability function, which satisfies certain consistency conditions, which imply invariance under permutations of and coherent marginalization over the th item. Hence, the law of a Gibbs partition is uniquely determined by the family . Furthermore, a random discrete probability measure governing a sequence of exchangeable observations is said to be of a Gibbs type if it induces a partition which can be expressed as in (7). These have associated predictive distributions which generalize the Blackwell and MacQueen (1973) Pólya-urn scheme to
where is a nonatomic probability measure, are the distinct values observed in with absolute frequencies , and the coefficients and are given by
with as above. It will be of later use to note that integrating both sides of (2) yields
the two-parameter Poisson–Dirichlet process [Pitman (1995, 1996)], obtained from (13) with and , the normalized stable process [Kingman (1975)], obtained from (13) with and , the normalized inverse-Gaussian process [Lijoi, Mena and Prünster (2005)] and the normalized generalized gamma process [Pitman (2003), Lijoi, Mena and Prünster (2007b)]. See also Gnedin (2010) for a Gibbs-type model with finitely-many types.
The normalized generalized gamma process is a random probability measure with representation
whose weights are obtained by means of the normalization
where are the points of a generalized gamma process, introduced by Brix (1999). This is obtained from a Poisson random process on with mean intensity
with and , so that if is the number of ’s which fall in , then is Poisson distributed with mean . Lijoi, Mena and Prünster (2007b) showed that a generalized gamma random measure defined via (14) and (15), denoted by , where with and , induces a random partition of Gibbs type with coefficients and in (2) given by
where denotes the upper incomplete gamma function
Special cases of a generalized gamma process with parameters are the Dirichlet process, obtained by letting and , the normalized stable process, obtained by setting , and the normalized inverse-Gaussian process, obtained by setting .
We conclude the section with a brief discussion of the interpretation of in the context of species sampling with Gibbs-type partitions. Suppose different species have been observed in the first samples from (2). The probability that a further sample is an already observed species is , but this mass is not allocated proportionally to the current frequencies. The ratio of probabilities assigned to any pair of species is
When , the probability of sampling species is proportional to the absolute frequency . However, since for , is increasing in , a value of reallocates some probability mass from type to type , so that, for example, for and we have for , respectively. Thus, has a reinforcement effect on those species that have higher frequency. See Lijoi, Mena and Prünster (2007b) for a more detailed treatment of this aspect.
3 Some results on generalized gamma random measures
In this section we investigate some properties of generalized gamma random measures which will be used in the subsequent constructions. In particular, these regard the convergence of the number of species represented only once in the observed sample, and the second order approximation of the weights of the generalized Pólya-urn scheme associated with normalized inverse-Gaussian processes.
Let be an -sized sample drawn from a generalized gamma process with parameters , let denote the number of distinct species observed in the sample, and let denote the vector of absolute frequencies associated with each observed species. The probability distribution of the random variable , for any , and frequencies such that , is provided by Lijoi, Mena and Prünster (2007b) and coincides with
where and are as in (8) and (17), respectively. Denote now by the number of species represented times in the sample. Then from equation 1.52 in Pitman (2006) it follows that the distribution of is given by
for any , and vector , where
The following proposition identifies the speed of convergence of the number of species represented once in the sample. Denote by the generalized factorial coefficient
where and . See Charalambides [(2005), Chapter 2] for a complete account.
Under the normalized generalized gamma process with parameters , one has
where is a strictly positive and almost surely finite random variable with density function
with being the density of a positive stable random variable with parameter .
Denote . From (3), for any one has
In particular, by using the definition of generalized factorial coefficient in terms of sum over the set of partitions [see Charalambides (2005), equation 2.62], we have
Therefore, we obtain
In order to obtain the distribution of the random variable , we can make use of the probability generating function of , denoted . From (3) we have
Therefore, the distribution of is given by
where the last identity is due to the fact that for any . Proposition 3 in Lijoi, Mena and Prünster (2007b) shows that
almost surely, where is an almost surely positive and finite random variable with density function
with being the density function of a positive stable random variable with parameter . In other terms, according to Definition 3.10 in Pitman (2006), an exchangeable partition of having EPPF (3) has -diversity . A simple application of Lemma 3.11 in Pitman (2006) leads to (22).
A second aspect of generalized gamma random measures we need to address for later use is the approximate behavior of the coefficients in the generalized Pólya urn (2). It is well known that the first order behavior of (16) is that of a normalized stable process, that is,
also implied by the next result. However, it turns out that for the definition of the diffusion processes which are the object of the next two sections, it is crucial to know the second order approximation. The following proposition, whose proof is deferred to the Appendix, identifies such behavior for the normalized inverse-Gaussian case .
Let and be as in (16). When ,
where and .
4 Alpha-diversity processes
Making use of the results of the previous section, here we construct a one-dimensional diffusion process which can be seen as a dynamic version of the notion of -diversity, recalled in (6), relative to the case of normalized inverse-Gaussian random probability measures. Such diffusion, which will be crucial for the construction of Section 5, is obtained as weak limit of an appropriately rescaled random walk on the integers, whose dynamics are driven by an underlying population process. This is briefly outlined here and will be formalized in Section 6. Consider particles, denoted with for each , where is a Polish space, and denote by the number of distinct values observed in . Let the vector be updated at discrete times by replacing a uniformly chosen coordinate. Conditionally on , the incoming particle will be a copy of one still in the vector, after the removal, with probability , and will be a new value with probability , where and are as in (16) and is the value of after the removal. Denote by the chain which keeps track of the number of distinct types in . Then, letting be the number of clusters of size one in , which, by means of (22) and (24) is approximately for large , the transition probabilities for ,
are asymptotically equivalent to
for . That is, with probability a cluster of size one is selected and removed, with probability a new species appears and with probability a survivor has an offspring. Note that and are set to be barriers, to render the fact that equals 0 and when equals 1 and , respectively.
The following theorem finds the conditions under which the rescaled chain converges to a diffusion process on . Here we provide a sketch of the proof with the aim of favoring the intuition. The formalization of the result is contained in the proof of Theorem 6.1, while that of the fact that the limiting diffusion is well defined, that is, the corresponding operator generates a Feller semigroup on an appropriate subspace of , is provided in Corollary 4.1 below.
Throughout the paper denotes the space of continuous functions from to , while denotes convergence in distribution.
Let be a Markov chain with transition probabilities as in (27) determined by a generalized gamma process with and , and define to be such that . Let also be a diffusion process driven by the stochastic differential equation
where is a standard Brownian motion. If , then
The conditional expected increment of the process is
Similarly, the conditional second moment of the increment is
Since , and recalling that almost surely, we have
It is easy to check that all conditional th moments of converge to zero for , whence it follows by standard theory [cf., e.g., Karlin and Taylor (1981)] that, as , the process converges in distribution to a diffusion process on with drift and diffusion coefficient .
As anticipated, the second order approximation of is crucial for establishing the drift of the limiting diffusion, as the first order terms cancel. It is interesting to note that when , which yields the normalized stable case, the limiting diffusion reduces to the diffusion approximation of a critical Galton–Watson branching process, also known as the zero-drift Feller diffusion. See, for example, Ethier and Kurtz (1986), Theorem 9.1.3. This also holds approximately for high values of , in which case the drift becomes negligible.
In order to have some heuristics on the behavior of the -diversity process, Figure 1 shows steps of the random walk with dynamics as in Theorem 4.1, starting from with . The three paths correspond to being equal to 0, 100 and 1000. It is apparent how influences the dynamic clustering structure in the population.
It is well known that when , the point 0 is an absorbing boundary for . The next result provides the boundary classification, using Feller’s terminology, for the case .
Let be as in Theorem 4.1 with . Then the points 0 and are, respectively, an entrance and a natural boundary.
The scale function for the process, defined as
and and denote drift and diffusion, equals