Inference for partially observed multitype branching processes and ecological applications

Inference for partially observed multitype branching processes and ecological applications

\snmCatherine Larédo \thanksreft1 label=e1]catherine.laredo@jouy.inra.fr [    \snmOlivier David label=e2] olivier.david@jouy.inra.fr [    \snmAurélie Garnier‘ label=e3] au.garnier@gmail.com [ INRA, Jouy-en-Josas and PMA Universités Paris 6 & 7 Catherine Larédo,
UR341, Mathématiques et Informatique appliquées,
INRA, F-78350 Jouy-en-Josas, France
& CNRS, UMR7599, Probabilités et Modèles aléatoires,
Universités Paris 6 & 7, 75005 Paris France.
\printeade1
Olivier David,
UR341, Mathématiques et
Informatique appliquées,
INRA, F-78350 Jouy-en-Josas,France.
\printeade2
Aurélie Garnier
Université Paris Sud,
Laboratoire Ecologie, Sytématique & Evolution,
UMR8079, F-91405 Orsay, France;CNRS, 91405 0rsay, France; AgroParisTech, 75231 Paris, France.
\printeade3
Abstract

Multitype branching processes with immigration in one type are used to model the dynamics of stage-structured plant populations. Parametric inference is first carried out when count data of all types are observed. Statistical identifiability is proved together with derivation of consistent and asymptotically Gaussian estimators for all the parameters ruling the population dynamics model. However, for many ecological data, some stages (i.e. types) cannot be observed in practice. We study which mechanisms can still be estimated given the model and the data available in this context. Parametric inference is investigated in the case of Poisson distributions. We prove that identifiability holds for only a subset of the parameter set depending on the number of generations observed, together with consistent and asymptotic properties of estimators. Finally, simulations are performed to study the behaviour of the estimators when the model is no longer Poisson. Quite good results are obtained for a large class of models with distributions having mean and variance within the same order of magnitude, leading to some stability results with respect to the Poisson assumption.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Partially observed multitype processes {aug} , ,

and

\thankstext

t1First supporter of the project

class=AMS] \kwd[Primary ]62M05, 62M09, 62P12 \kwd[; secondary ]92D40,60J85

Parametric inference, Incomplete data, Multitype branching processes, ecology.

1 Introduction

Understanding population dynamics requires models that admit the complexity of natural populations and the data ecologists can get from them. Thus analyzing ecological data raises questions ranging from modeling purposes to statistical inference. Among various methods, Leslie matrices or demographic matrix models are widely used for studying the dynamics of age or stage-structured populations (e.g. Caswell, 2001). These models are deterministic with noise added to introduce some variability. In many cases however and especially for small populations, the demographic stochasticity has to be taken into account; these models are too simple (Melbourne and Hastings, 2008) and can no longer be used, even adding stochasticity into the dynamics with random effects and covariates (Royle, 2004; Barry et al., 2003) or Bayesian approaches (Raftery et al., 1995; Gross et al., 2002; Clark and Bjornstad, 2004). For these reasons, we use here stochastic models to study small populations dynamics.

The starting point of this work is a three-year field survey of feral populations (i.e. populations escaped from crops) of an annual crop species (oilseed rape) that was carrried out in the center of France (Selommes, Loir-et-Cher; Garnier, 2006). Unlike cultivated oilseed rape, very few facts are known about the dynamics of feral oilseed rape populations. In this study, the dynamics is modeled by a multitype branching process (five types including vegetative and reproductive plant stages along with seeds in the soil seedbank) with immigration in one type (seeds). Data consisted in populations counts in each type, except the seeds that could not be observed. Three main difficulties occur when studying this demographic dataset. (1) A large number of populations () have been observed over a short period of time (); (2) only count data have been collected; (3) some types could not be observed by ecologists (here seeds). These characteristics are clearly not specific to this survey and are frequently met in data coming from Population Genetics and Ecology (see e.g. de Valpine, 2004 and the references therein). These data could be studied as longitudinal data, but for concerns about the dynamics, better insights can be obtained by means of mechanistic models describing it.

Branching processes have largely been studied (see Athreya and Ney, 1972 for a general presentation; Mode, 1971 for Multitype Branching Processes and Haccou et al., 2005; Kimmel and Axelrod, 2002; Mode and Sleeman, 2000 for applications in biology). Statistical inference has also been largely investigated (Hall and Heyde, 1980; Guttorp, 1991 for general branching processes; Wei and Winnicki, 1990; Winnicki, 1991 for branching processes with immigration; Bhat and Adke, 1981; Maaouia and Touati, 2005; Gonzalez et al., 2008 for multitype branching processes). However, the precise multitype branching process with immigration used here is a combination of the previous ones and moreover statistical inference for multitype branching processes is usually based on the following different observations (Maaouia and Touati, 2005; Gonzalez et al., 2008): the number of descendants of type coming from all the type individuals. We just observed the successive counts of ”individuals” of each type. This is more realistic assumption since this situation frequently occurs with datasets from field studies, and inference is studied here in this framework.

We are interested in the estimation of the parameters involved in the population dynamics from the incomplete observations of count data collected simultaneously in several populations. This is an “Incomplete Data model”, or “State Space model” (as defined for instance in (Cappé et al., 2005). It is also an inverse problem and a central theme in Ecology arising from its study is that parameters might not be identifiable knowing only the population dynamics (Wood, 1997). In practice, the inference based on such data is performed using various E.M. algorithms eventually coupled with Monte Carlo methods (Dempster et al., 1977; Kuhn and Lavielle, 2004; McLachlan and Krishnan, 2007; Sung and Geyer, 2007; Olsson et al., 2008), and Bayesian or Hierarchical Bayesian methods (Clark and Bjornstad, 2004; Buckland et al., 2004; Thomas et al., 2005). All these methods circumvent but cannot address the identifiability problem. However, identifiability is a prerequisite of statistical inference, and understanding the dynamics mechanisms strongly relies on how parameters are linked in the identifiability question. We propose here an integrated framework in order to analyze as accurately as possible the whole data set of the field survey. Introduction of covariates and a priori knowledge, errors coming from non exhaustive population samplings within some populations, use of various algorithms rely on this work and are studied in two companion papers (David et al., 2008; Garnier et al., 2008).

The paper is organized as follows. Section 2 contains the description of the population dynamics and preliminary results (Proposition 2.1). The statistical inference for complete observations is studied in Section 3. We first prove identifiability for all the parameters and derive consistent and asymptotically Gaussian estimators (Proposition 3.1). Since seeds are not observed in practice, the problem of unobserved types is addressed in Section 4. This is a non linear non Gaussian state space model. The associated three-dimensional stochastic process is no longer Markovian. The model with Poisson distributions provides a useful example with explicit computations. We obtain a closed form of the dependence of present observations on the whole past for the three-dimensional process (Theorem 4.1). A question concerns the statistical model identifiability : it is studied according to the number of observed generations. We characterize the parameter subset where identifiability holds (Theorem 4.2) and study the parametric inference (Proposition 4.2). Section 5 presents simulation results to study how the estimation performs with respect to deviations to the Poisson model. Detailed proofs are given in the Appendix.

2 Model and preliminary results

2.1 Dynamics of annual plants

We consider annual plants with the following life cycle. Seeds are released at the end of summer; they can either enter in a seed bank if buried or germinate in autumn.The emerged rosettes vernalize during winter, then bolt in spring and finally produce mature plants that shed seeds in summer and then die. Five developmental stages are considered: rosettes before winter , rosettes after winter (vernalized rosettes) , mature plants , seeds located in the soil seed bank (“old seeds”) , and seeds located on the soil surface (“new seeds”) . New seeds and old seeds are separated because they have different demographic parameters. Within each cycle, new seeds can enter these populations at the end of summer (immigration). There exist two sources of seed immigration, seeds from adjacent mature crops and seeds from spillage during seed transport (Crawley and Brown, 1995; Claessen et al., 2005a; Garnier et al., 2006; Pivard et al., 2008).


Figure 1: Schematic Dynamics of feral oilseed rape populations.

This model is quite general and for dynamical purposes only, it could be simplified considering just seeds and mature plants. However, our concern is different since we aim at estimating as many parameters as possible given both the model and the data available. Keeping a five-type model allows using all the data collected in the field survey and thus leads to the best description of the plant population dynamics for inference.

2.2 Notations and assumptions

Consider first one population. From now on, the term year corresponds to one life cycle. It starts with the birth of the new seeds and ends just before the birth of the new seeds of the next generation. All the variables are integer random variables indexed by . For year , denote by the number of ”old seeds”, the number of ”new seeds”, the number of rosettes before winter, the number of rosettes after winter and the number of mature plants. Six parameters describe these transitions:

(2.1)
(2.2)
(2.3)
(2.4)

Mature plants in produce “new seeds” according to the offspring distribution .

A number of ”new seeds” immigrate into the population at the beginning of year ; it is assumed to follow the distribution . Seeds in come from two sources, the ”old seeds” and the ”new seeds” . Denote by and these two quantities. Similarly, rosettes before winter come from “old seeds” in and “new seeds” in . Denote by and the rosettes coming from and . These variables satisfy :

(2.5)
(2.6)

Note that the probabilities of dying for stages are respectively , , , and that the probability of no offspring for a mature plant is .

Let us now detail the framework and assumptions used in the sequel. The field survey consisted of a large number of feral oilseed rape populations (around 300 in Garnier, 2006) observed over a short period of time ( = 2 or 3). These populations were isolated, so we assume here independence for these populations and the plant density was low in the surveyed populations so that density-dependence in plant survival and reproduction could be neglected (Pivard et al., 2008; Garnier, 2006). Moreover, we assume

Assumption 1.

Offspring distribution
All mature plants reproduce independently according to the same offspring distribution with expectation and variance .

Assumption 2.

There is no competition in survival and germination between seeds in the seed bank or “old seeds” and seeds on the soil or “new seeds” .

Assumption 3.

Immigration distribution
(i) Immigration is independent of seed bank seeds , offspring seeds and of previous years.
(ii) The random variables are independent and identically distributed according to with expectation and variance .

The above assumptions are twofold: “independence” and “identically distributed”. They do not have the same status : while releasing the “independence ” assumption is quite difficult, the “identically distributed ” assumption is done here for sake of clarity. The independence assumption in (A1) is justified because of the low plant density. There is no biological background for considering competition between the evolution of ”old seeds” and ”new seeds” leading to (A2). In the field survey of feral oilseed rape populations, seed immigration mainly occured from spillage during seed transport and from adjacent mature crops, leading to the independence assumptions in (A3) (Pivard et al., 2008; Garnier, 2006). Adding covariates or a priori information can easily be introduced within this framework, which amounts to remove the “identically distributed” assumption. This indeed has been done for the statistical analysis on the whole data set: using that many populations had been observed, covariates, a priori knowledge, and a dependence with respect to of the parameters defined in (2.2)-(2.4) have been added (Garnier et al., 2008). The framework detailed here is therefore quite general.

2.3 Preliminary results

From now on, time is associated with a complete cycle of a plant. Let us still consider one population. The model is a discrete time stochastic process with state space . Set

(2.7)

For , , denote the initial distribution and the conditional distribution of given :

(2.8)

For the Binomial distribution , we write ; Multinomial distributions on , are simplified omitting the last component, which leads for and ,

(2.9)

Let denote the convolution product of two distributions. Using these notations and (2.1), define the distributions

(2.10)
(2.11)
(2.12)
(2.13)

Using now definitions (2.7), (2.8) and notations (2.9)- (2.13), the following holds.

Proposition 2.1.

Under (A1), (A2), (A3), is a time homogeneous Markov chain on with initial distribution and transition probabilities satisfying

(2.14)
(2.15)

The process is also a multitype branching process with immigration in one type.

The last statement of Proposition 2.1 is immediate since, considering each stage of the plant as a type, each plant reproduces independently from the others according to the same offspring distribution with values in . However, the two types and have no offspring in the next generation, leading to a non positively regular multitype process as defined in Mode (1971) or Athreya and Ney (1972). The process is positively regular process but, for the reasons stated in 2.1, we prefer keeping here the five-dimensional process .

The proof of Proposition 2.1 is given in Appendix A.1.

3 Likelihood and inference for complete observations

We first study the case when all types are observed.

3.1 Notations and statistical framework

We assume in the sequel that the initial distribution of , the offspring distribution and the immigration belong to the parametric families :
- distribution of :
- offspring : with mean and variance ;
- immigration : with mean and variance .

Let us denote by (resp. ) an arbitrary value (resp. the true value) of the parameter and by the parameter set. We assume :

Assumption 4.

compact set of and .

For the th population, is the Markov chain describing the population dynamics and are the observations at time . In order to simplify the expressions for the likelihood, we consider here that , are observed up to time . This has no consequence since seeds are not observed in practice. Hence we denote,

(3.1)
(3.2)

The processes are repetitions of . Joining all the populations, define :

(3.3)

Let (resp. ) be the initial distribution (resp. the transition probabilities) associated with parameter and the probability distribution of on the canonical space and the expectation w.r.t. .

3.2 Likelihood

Computing the likelihood of a Markov process with transition probabilities is classical : for population , it has for expression, using (2.14),

(3.4)

Joining the observations from the independent populations, we obtain using Proposition 2.1, (2.10) - (2.13)

(3.5)

Reordering the terms of (3.4) and (3.5) according to the parameters yields,

(3.6)

The first term deals with the initial distribution of

(3.7)

Gather in the second term the transition from seeds to rosettes .

(3.8)

The next two terms contain the transitions from to and to , they write

(3.9)
(3.10)

where the two terms and only depend on the observations. Set in the next term the transition from to ,

(3.11)

The last term concerns the seeds :

(3.12)

Joining the two terms containing yields

(3.13)

The terms and depend on disjoint sets for the parameters. Hence, maximizing the loglikelihood can be performed maximizing separately these five terms.

Remark 3.1.

Usually, statistical inference for Stochastic Processes is investigated in the asymptotic framework small (mostly ) and large (leading to asymptotics results ). Here, we have that is small and large (e.g. magnitude 300). This situation often occurs in Ecology (de Valpine, 2004).

3.3 Study of maximum likelihood and other estimators

This is a sample of i.i.d. random variables, each variable being a part of a branching process path. We have to use simultaneously the repetitions and the Markov structure to estimate the parameters. So, deriving the properties of the statistical model is not standard. The various terms of the loglikelihood are associated with different parametric inference problems.

The first term deals with the estimation of based on a sample of i.i.d. random variables with distribution on , which is standard. The terms and are related to the estimation of parameters . Denote by the maximum likelihood estimators (MLE) obtained maximizing and . They depend on the successive observations (resp. ) for ; and are explicit (see Appendix A.2).

Parameters are only present in defined in (3.13). Maximum likelihood estimators for can be defined maximizing . To prove identifiability and consistency, we have rather consider here conditional least squares (CLS) estimators. Conditionally on , the marginal distribution of (resp. ) is the sum of the two independent distributions, and (resp. and ). Therefore, we can define the CLS estimators and minimizing the Conditional Least Squares:

(3.14)
(3.15)

The remaining term is since we are concerned by the estimation of and . This is the only part of the likelihood associated with the branching mechanism. This likelihood containing the convolution product is untractable, and methods based on conditional least squares or weighted conditional least squares are used for branching processes (Hall and Heyde, 1980; Guttorp, 1991; Wei and Winnicki, 1990) leading to moment estimations of and . Noting that , we can just consider for the estimation of and , the CLS process,

(3.16)

Let be the CLS estimators minimizing (3.16). All the above estimators are explicitly defined in Appendix A.2 and we can state :

Proposition 3.1.

Assume (A1), (A2), (A3) and (A4). Then, under , all the parameters are identifiable and, as ,
(i) are consistent and asymptotically Gaussian at rate ;
(ii) , , , are asymptotically independent, with explicit covariance matrix given in Appendix A.2.

Let us stress that, before studying in detail this inference problem, it was difficult to assert the classical properties stated in Proposition 3.1. Adding immigration could lead to non identifiability or estimating problems for and . Moreover, maximum likelihood estimators, for multitype branching processes, are based on the observations of , i.e. offspring of type from type parents (see Guttorp, 1991; Gonzalez et al., 2008; Maaouia and Touati, 2005). We did not require this information for the inference and just used the counts of individuals of each type in successive generations. Hence, getting identifiability and consistency for the parameters is the only difficulty here. Other properties are classical but requires using the exact structure of the data provided the regularity of the statistical model.

Remark 3.2.

Estimating additional moments of and can be performed similarly using other functionals than Conditional Least Squares (see Winnicki, 1991 for the variance estimation of and ).

The proof is given in Appendix A.2.

4 Incomplete model study in the Poisson case

The set-up is now different: only are observed while and are unobserved. Clearly, algorithms simulating the missing data given the model and the parameters at each step can be used to get estimation. This approach is complementary to our concern that aims at understanding which mechanisms can be estimated. For this, we have to study the process for . It is a discrete time stochastic process, which is no longer Markov : the distribution of given the past now depends on the whole past and not only on . This appears explicitely later on. This is similar to problems encountered when studying Hidden Markov Models (Genon-Catalot et al., 2003; Genon-Catalot and Larédo, 2006; Cappé et al., 2005). For a first approach, we restrict our attention to a very informative example, the case of Poisson distributions, which leads to explicit computations.

4.1 Probabilistic properties in the Poisson case

Let us specify all the distributions appearing in the populations dynamics.

Assumption 5.

The offspring distribution is Poisson , the immigration distribution is Poisson .

Assumption 6.

The variables and are independent and distributed according to Poisson distributions: and .

For Poisson distributions, we denote . Recall a property of Multinomial and Poisson distributions.

Lemma 4.1.

Assume that N is a random variable distributed according to a Poisson distribution and that is a l-dimensional random variable such that the conditional distribution of given is a Multinomial distribution with . Then, the random variables are independent and verify .

First consider one population and omit the index in what follows. Set

(4.1)

Clearly, is the information available up to time . To state the main result of this section, define the three sequences of measurable random variables:

(4.2)
(4.3)
(4.4)

Then, the following holds :

Theorem 4.1.

Under Assumptions (A1)-(A6), the initial distribution of , and the conditional distribution satisfy, using (4.1) and definitions (2.13),(4.4), for

(4.5)
(4.6)

The explicit dependence of on the whole past appears more simply with the following expression for ,

(4.7)
(4.8)
(4.9)
(4.10)

The result of Theorem 4.1 is a consequence of the proposition stated below.

Proposition 4.1.

Under Assumptions (A1)-(A6), the random variables , are conditionally independent given , and their conditional distributions satisfy, using the random variables and defined in (4.2), (4.3),

(4.11)

Let us prove Theorem 4.1 assuming Proposition 4.1. We just have to check the expression of the conditional distribution of . By (2.5), we have . Using Proposition 2.1 and (2.12), the distribution of conditionnally on is equal to . Applying Proposition 4.1, and are conditionally independent given and distributed according to two independent Poisson distributions. Hence, another application of Lemma 4.1 yields that the conditional distribution of given is , which is (4.4).

The proof of Proposition 4.1 is given in Appendix A.3.

4.2 Likelihood of the incomplete observations

The inference is now based on the observations of recorded up to time for independent populations. Denote by the process describing its dynamics in population and the observations at time . We set,

(4.12)

Observations up to time are denoted

(4.13)

Let us first compute the likelihood for one population, population . Successive conditionnings yield

(4.14)

Contrary to the previous section, each term of this product depends on and on the observations up to time . Theorem 4.1 gives the expression of these conditional probabilities.

Since the random variables now depend on and on the past, we define , and for population ,

(4.15)
(4.16)

Joining the observations in the populations, the likelihood writes, using notations (4.12), (4.13), (4.16),

(4.17)

The log-likelihood splits into three terms,

(4.18)

where, using Assumptions (A5), (A6) and Theorem 4.1,

(4.19)
(4.20)

Hence, estimating parameters is exactly the same as in the previous section; their inference is omitted in the sequel.

4.3 Parametric inference

It remains to study the estimation of the parameter . As before, let be true value of the parameter.

We first have to investigate which parameters are identifiable when only these incomplete observations are available. By identifiability, we mean here identifiability of a statistical model :

Recall that denotes the time index of a plant lifecycle and that we consider populations recorded up to time ( corresponding here to the observation of one complete cycle). Using now the definitions of the terms given in (4.7), the following holds :

Theorem 4.2.

Assume (A1), (A2), (A3), (A4), (A5), (A6). Then,

  1. if , only is identifiable;

  2. if , is identifiable;

  3. if , , is identifiable;

  4. if , and , then is identifiable;
    if and , then .

Note that identifiability of additional parameters cannot be gained increasing beyond 3. Larger values of result in improving the asymptotic variance for the estimation of .

Remark 4.1.

Stating the above theorem for the first values of is unusual. However, in the field survey of feral oilseed rape populations, observations unfortunately had been collected up to , leading to the unability of estimating , the annual survival rate in the seed bank, which is a parameter of much concern in Ecology.


Assume now that , and let us study the inference of the identifiable parameters. Let us denote by (resp. ) an arbitrary (resp. the true value) of the paramete, by the parameter set and asssume

Assumption 7.

is a compact set of ; and

Using (4.19), define the maximum likelihood estimator as a solution of

(4.21)

Under Assumption (A7), the function is a.s. twice differentiable on , and we can define, for , the matrix,

Proposition 4.2.

Assume (A1)-(A3), (A5), (A6) and (A7). Then is strongly consistent. If moreover the matrix is invertible, then

Remark 4.2.

The matrix that appears in the asymptotic variance of can be estimated, using the explicit expressions for the derivatives of , by the empirical estimate for ,

The proofs of Theorem 4.2 and Proposition 4.2 are given in Appendix A.4.A.5.

5 Simulation study

In the case of incomplete observations, the estimators strongly rely on the assumption that both the offspring and the immigration distributions are Poisson distributions. We investigate in this section how the estimation behaves when these distributions are no longer Poisson.

5.1 Methods

We considered in these simulations that the offspring distribution