Statistical estimation of a growth-fragmentation model observed on a genealogical tree
We raise the issue of estimating the division rate for a growing and dividing population modelled by a piecewise deterministic Markov branching tree. Such models have broad applications, ranging from TCP/IP window size protocol to bacterial growth. Here, the individuals split into two offsprings at a division rate that depends on their size , whereas their size grow exponentially in time, at a rate that exhibits variability. The mean empirical measure of the model satisfies a growth-fragmentation type equation, and we bridge the deterministic and probabilistic viewpoints. We then construct a nonparametric estimator of the division rate based on the observation of the population over different sampling schemes of size on the genealogical tree. Our estimator nearly achieves the rate in squared-loss error asymptotically, generalizing and improving on the rate obtained in [DHRR, DPZ] through indirect observation schemes. Our method is consistently tested numerically and implemented on Escherichia coli data, which demonstrates its major interest for practical applications.
Keywords: Growth-fragmentation, cell division equation, Nonparametric estimation, Markov chain on a tree.
Mathematical Subject Classification: 35A05, 35B40, 45C05, 45K05, 82D60, 92D25, 62G05, 62G20.
1.1 Size-structured models and their inference
Growth-fragmentation models and structured population equations describe the temporal evolution of a population characterised by state variables such as age, size, growth, maturity, protein content and so on - see [Metz, Pe] and the references therein. This field continues to grow in interest as its applications appear to be wider and wider, ranging from the internet TCP/IP window size protocol [Baccelli] to neuronal activity [PPS3], protein polymerization [Engler], cell division cycle [Banks1], phase transitions [NP4], parasite proliferation [Bansaye] etc.
In order to quantitatively fit experimental observations and thus validate the relevance of the models, developing new and well-adapted statistical methods appears to be one of the major challenge for the coming years. A paradigmatic example, which can serve both as a toy model and a proof of concept for the methodology we develop here, is given by the growth-fragmentation or size-structured cell division equation [PR]. When applied to the evolution of a bacterial population it reads
and it quantifies the concentration of individuals (cells) having size (the state variable) at time . A common stochastic mechanism for every single cell is attached to Equation (1):
The size of a cell at time evolves exponentially according to the deterministic evolution , where is the growth rate of each cell, that quantifies their ability to ingest a common nutrient.
Each cell splits into two offsprings according to a division rate that depends on its current size .
At division, a cell of size gives birth to two offsprings of size each, what is called binary fission.
Model (1) is thus entirely determined by the parameters . Typically, the growth rate is assumed to be known or guessed [DMZ], and thus inference about (1) mainly concerns the estimation of the division rate that has to be taken from a nonparametric perspective.
By use of the general relative entropy principle, Michel, Mischler and Perthame showed that the approximation is valid [MMP], with , and where is the dominant eigenpair related to the corresponding eigenvalue problem, see [M1, Pe, DG, LP, CCM, BCG]. The “stationary” density of typical cells after some time has elapsed enables to recover for a compact by means of the regularisation of an inverse problem of ill-posedness degree 1. From a deterministic perspective, this is carried out in [PZ, DPZ, DT]. From a statistical inference perspective, if an -sample of the distribution is observed and if has smoothness in a Sobolev sense, it is proved in [DHRR] that can be recovered in squared-error loss over compact sets with a rate of convergence . Both deterministic and stochastic methodology of [DPZ] and [DHRR] are motivated by experimental designs and data such as in [Kubitschek, DMZ]. However, they do not take into account the following two important aspects:
Bacterial growth exhibits variations in the individual growth rate as demonstrated for instance in [Sturm]. One would like to incorporate variability in the growth rate within the system at the level of a single cell. This requires to modify Model (1).
Recent evolution of experimental technology enables to track the whole genealogy of cell populations (along prescribed lines of descendants for instance), affording the observation of other state variables such as size at division, lifetime of a single individual and so on [Robert]. Making the best possible use of such measures is of great potential impact, and needs a complementary approach.
The availability of observation schemes at the level of cell individuals suggests an enhancement of the statistical inference of , enabling to improve on the rates of convergence obtained by indirect measurements such as in [DHRR, DPZ]. This is the purpose of the present paper. We focus on bacterial growth, for which we apply our method on experimental observations. This serves as a proof of concept for the relevance of our modelling and statistical methodology, which could adapt to other application fields and growth-fragmentation types.
1.2 Results of the paper
denote the binary genealogical tree (with ). We identify each node with a cell that has a size at birth and a lifetime . In the paper, we consider the problem of estimating over compact sets of . Our inference procedure is based on the observation of
where denotes a connected subset of size containing the root . Asymptotics are taken as . Two important observation schemes are considered: the sparse tree case, when we follow the system along a given branch with individuals, and the full tree case, where we follow the evolution of the whole binary tree up to the -th generation, with . In this setting, we are able to generalise Model (1) and allow the growth rate to vary with each cell . We assume that a given cell has a random growth rate (later constrained to live on a compact set). Moreover, this value is inherited from the growth rate of its parent according to a distribution . Since a cell splits into two offsprings of the same size, letting denote the parent of , we have the fundamental relationship
that enables to recover the growth rate of each individual in since is connected by assumption, possibly leaving out the last generation of observed individuals, but this has asymptotically no effect on a large sample size approach.
Variability in growth rate
In the case where the growth rate can vary for each cell, the density of cells of size at time does not follow Eq. (1) anymore and an extended framework needs to be considered. To that end, we structure the system with an additional variable which represents the growth rate and depends on each individual cell . We construct in Section 2 a branching Markov chain that incorporates variability for the growth rate in the mechanism described in Section 1.1. Equivalently to the genealogical tree, the system may be described in continuous time by a piecewise deterministic Markov process
which models the process of sizes and growth rates of the living particles in the system at time , with value in , where is the state space of size times growth rate. Stochastic systems of this kind that correspond to branching Markov chains are fairly well known, both from a theoretical angle and in applications; a selected list of contributions is [BDMV, Cloez, MT] and the references therein.
By fragmentation techniques inspired by Bertoin [bertoin], see also Haas [Haas], we relate the process to a growth-fragmentation equation as follows. Define
as the expectation of the empirical measure of the process over smooth test functions defined on . We prove in Theorem 1 that, under appropriate regularity conditions, the measure that we identify with the temporal evolution of the density of cells having size and growth rate at time is governed (in a weak sense111 For every , we actually have a Radon measure on : If is a function defined on , we define whenever the integral is meaningful. Thus (4) has the following sense: for every sufficiently smooth test function with compact support in , we have ) by
If we assume a constant growth rate , we then take (where denotes the Dirac mass) and we retrieve the standard growth-fragmentation equation (1). The proof of Theorem 1 is obtained via a so-called many-to-one formula, established in Proposition 3 in Section 5.1. Indeed, thanks to the branching property of the system, it is possible to relate the behaviour of additive functionals like the mean empirical measure to the behaviour of a so-called tagged cell (like a tagged fragment in fragmentation process), that consists in following the behaviour of a single line of descendants along a branch where each node is picked at random, according to a uniform distribution. This approach, inspired by fragmentation techniques, is quite specific to our model and enables to obtain a relatively direct proof of Theorem 4.
Nonparametric estimation of the growth rate
In Section 3 we take over the problem of estimating for some compact . We assume we have data of the form (2), and that the mean evolution of the system is governed by (4). The growth rate kernel is unknown and treated as a nuisance parameter. A fundamental object is the transition kernel
of the size and growth rate distribution at the birth of a descendant , given the size of birth and growth rate of its parent . We define in Section 3.3 a class of division rates and growth rate kernels such that if belongs to this class, then the transition is geometrically ergodic and has a unique invariant measure . From the invariant measure equation
we obtain in Proposition 2 the explicit representation
where denotes the first marginal of the invariant distribution . A strategy for constructing and estimator consists in replacing the right-hand size of (5) by its empirical counterpart, the numerator being estimated via a kernel estimator of the first maginal of . Under local Hölder smoothness assumption on of order , we prove in Theorem 2 that for a suitable choice of bandwidth in the estimation of the invariant density, our estimator achieves the rate in squared-error loss over appropriate compact sets , up to an inessential logarithmic term when the full tree observation scheme is considered. We see in particular that we improve on the rate obtained in [DHRR]. Our result quantifies the improvement obtained when estimating from data , as opposed to overall measurements of the system after some time has elapsed as in [DHRR]. We provide a quantitative argument based on the analysis of a PDE that explains the reduction of ill-posedness achieved by our method over [DHRR] in Section 4.2.
In order to obtain the upper bound of Theorem 2, a major technical difficulty is that we need to establish uniform rates of convergence of the empirical counterparts to their limits in the numerator and denominator of (5) when the data are spread along a binary tree. This can be done via covariance inequalities that exploit the fact that the transition is geometrically ergodic (Proposition 4) using standard Markov techniques, see [MT, B]. The associated chain is however not reversible, and this yields an extraneous difficulty: the decay of the correlations between and for two nodes are expressed in terms of the sup-norm of , whenever is dominated by a certain Lyapunov function for the transition . However, the typical functions we use are kernels that depend on and that are not uniformly bounded in sup-norm as . This partly explains the relative length of the technical Sections 5.5 and 5.6.
1.3 Organisation of the paper
In Section 2, we construct the model of sizes and growth rates of the cells as a Markov chain along the genealogical tree. The discrete model can be embedded into a continuous time piecewise deterministic Markov process of sizes and growth rates of the cells present at any time within the system. In Theorem 1 we explicit the relation between the mean empirical measure of and the growth-fragmentation type equation 4. In Section 3, we explicitly construct an estimator of by means of the representation given by (5) in Section 3.2. Two observation schemes are considered and discussed in Section 3.1, whether we consider data along a single branch (the sparse tree case) or along the whole genealogy (the full tree case). The specific assumptions and the class of admissible division rates and growth rate kernels are discussed in Section 3.3, and an upper bound for in squared-error loss is given in our main Theorem 2. Section 4 shows and discusses the numerical implementation of our method on simulated data. In particular, ignoring the variability in the reconstruction dramatically deterioriates the accuracy of estimation of . We also explain from a deterministic point perspective the rate improvement of our method compared with [DHRR] by means of a PDE analysis argument in Section 4.2. The parameters are inspired from real data experiments on Escherichia coli cell cultures. Section 5 is devoted to the proofs.
2 A Markov model on a tree
2.1 The genealogical construction
Recall that (with ) denotes the infinite binary genealogical tree. Each node is identified with a cell of the population and has a mark
where is the size at birth, the growth rate, the birthtime and the lifetime of . The evolution of the size of during its lifetime is governed by
Each cell splits into two offsprings of the same size according to a division rate for . Equivalently
At division, a cell splits into two offsprings of the same size. If denotes the parent of , we thus have
Finally, the growth rate of is inherited from its parent according to a Markov kernel
where and is a probability measure on for each . Eq. (6), (7), (8) and (9) completely determine the dynamics of the model , as a Markov chain on a tree, given an additional initial condition on the root. The chain is embedded into a piecewise deterministic continuous Markov process thanks to (6) by setting
and otherwise. Define
as the process of sizes and growth rates of the living particles in the system at time . We have an identity between point measures
where denotes the Dirac mass. In the sequel, the following basic assumption is in force.
Assumption 1 (Basic assumption on and ).
The division rate is continuous. We have and . The Markov kernel is defined on a compact set .
Work under Assumption 1. The law of
is well-defined on an appropriate probability space with almost-surely no accumulation of jumps.
If is a probability measure on the state space , we shall denote indifferently by the law of any of the three processes above where the root has distribution . The construction is classical (see for instance [bertoin] and the references therein) and is outlined in Appendix 6.1.
2.2 Behaviour of the mean empirical measure
Denote by the set of real-valued test functions with compact support in the interior of .
Theorem 1 (Behaviour of the mean empirical measure).
Work under Assumption 1. Let be a probability distribution on . Define the distribution by
Then solves (in a weak sense)
with initial condition .
Theorem 1 somehow legitimates our methodology: by enabling each cell to have its own growth rate and by building-up new statistical estimators in this context, we still have a translation in terms of the approach in [DPZ]. In particular, we will be able to compare our estimation results with [DHRR]. Our proof is based on fragmentation techniques, inspired by Bertoin [bertoin] and Haas [Haas]. Alternative approaches to the same kind of questions include the probabilistic studies of Chauvin et al. [CRW], Bansaye et al. [BDMV] or Harris and Roberts [HR] and references therein.
3 Statistical estimation of the division rate
3.1 Two observation schemes
Let denote a subset of size of connected nodes: if belongs to , so does its parent . We look for a nonparametric estimator of the division rate
Statistical inference is based on the observation scheme
and asymptotic study is undertaken as the population size of the sample . We are interested in two specific observation schemes.
The full tree case.
We observe every pair over the first generations of the tree:
with the notation if , and is chosen such that that has order . ∎
The sparse tree case.
We follow the first offsprings of a single cell, along a fixed line of descendants. This means that for some with , we observe every size and growth rate of each node , , and so on up to a final node . ∎
For every , we tacitly assume that there exists a (random) time almost surely, such that for , the observation scheme is well-defined. This is a consequence of the behaviour of near infinity that we impose later on in (17) below.
3.2 Estimation of the division rate
Identification of the division rate
We denote by an element of the state space . Introduce the transition kernel
of the size and growth rate distribution at the birth of a descendant , given the size at birth and growth rate of its parent . From (7), we infer that is equal to
Using formula (8), by a simple change of variables
Incorporating (9), we obtain an explicit formula for
Assume further that admits an invariant probability measure , i.e. a solution to
denotes the left action of positive measures on for the transition .
We exhibit below a class of division rates and growth rate kernels that guarantees the existence of such an invariant probability measure.
Construction of a nonparametric estimator
Inverting (13) and applying an appropriate change of variables, we obtain
provided the denominator is positive. This formula has no easy interpretation: it is obtained by some clever manipulation of the equation . A tentative interpretation in the simplest case with no variability (so that for some fixed and for every is proposed in Section 4.2. Representation (14) also suggests an estimation procedure, replacing the marginal density and the expectation in the denominator by their empirical counterparts. To that end, pick a kernel function
and set for and . Our estimator is defined as
where is a threshold that ensures that the estimator is well defined in all cases and . Thus is specified by the choice of the kernel , the bandwidth and the threshold .
The function is bounded with compact support, and for some integer , we have
3.3 Error estimates
We assess the quality of in squared-loss error over compact intervals . We need to specify local smoothness properties of over , together with general properties that ensure that the empirical measurements in (15) converge with an appropriate speed of convergence. This amounts to impose an appropriate behaviour of near the origin and infinity.
For such that and a vector of positive constants , introduce the class of continuous functions such that
Similar conditions on the behaviour of can also be found in [DG], in a deterministic setting.
Assumption 1 is satisfied as soon as (and is defined on a compact of course).
Let be two positive finite measures on such that is a positive measure and . We define as the class of Markov transitions on such that
Control (18) ensures the geometric ergodicity of the process of variability in the growth rate.
Let us be given in the sequel a vector of positive constants and such that . We introduce the Lyapunov function
The function controls the rate of the geometric ergodicity of the chain with transition and will appear in the proof of Proposition 4 below. Define
Assumption 3 (The sparse tree case).
Let . We have .
In the case of the full tree observation scheme, we will need more stringent (and technical) conditions on . Let denote the spectral radius of the operator acting on the Banach space of functions such that
where is defined in (19) above.
Assumption 4 (The full tree case).
We have and moreover
It is possible to obtain bounds on so that (20) holds, by using explicit (yet intricate) bounds on following Fort et al. or [FMP], Douc et al. [DMR], see also Baxendale [B].
Rate of convergence
We are ready to state our main result. For , with , and an integer, introduce the Hölder space of functions possessing a derivative of order that satisfies
The minimal constant such that (21) holds defines a semi-norm . We equip the space with the norm
and the Hölder balls
For every , there exist and such that for every and every compact interval such that , we have
where the supremum is taken over
and denotes expectation with respect to any initial distribution for on such that .
Several remarks are in order: 1) Since is arbitrary, we obtain the classical rate (up to a log term) which is optimal in a minimax sense for density estimation. It is presumably optimal in our context, using for instance classical techniques for nonparametric estimation lower bounds on functions of transition densities of Markov chains, see for instance [GHR]. 2) The extra logarithmic term is due to technical reasons: we need it in order to control the decay of correlations of the observations over the full tree structure. 3) The knowledge of the smoothness that is needed for the construction of is not realistic in practice. An adaptive estimator could be obtained by using a data-driven bandwidth in the estimation of the invariant density in (15). The Goldenschluger-Lepski bandwidth selection method [GL2], see also [DHRR] would presumably yield adaptation, but checking the assumptions still requires a proof in our setting. We implement data-driven bandwidth in the numerical Section 4 below.
4 Numerical implementation
4.1 Protocol and results
Generating simulated data
Given a division rate , a growth rate kernel , an initial distribution for the node (as in Theorem 2) and a dataset size , we simulate the full tree and the sparse tree schemes recursively:
Given we select at random its lifetime (by a rejection sampling algorithm) with probability density
following the computations of Section 3.2.
We derive the size at birth for the two offsprings (with and with obvious notation) by Formula (8).
We simulate at random the growth rates (by the rejection sampling algorithm) according to the distribution
For the sparse tree case, we select only one offspring (either of ), whereas we keep both for the full tree case.
In order to stay in line with previous simulations of [DHRR] we pick . We fix as the uniform distribution over , with . As for the growth rate kernel, we implement
where is a uniform distribution over for some , and dilated by a scaling factor so that . We also condition the values of to stay in (by rejection sampling).
We implement using Formula (15). We pick a standard Gaussian kernel , for which in Assumption (2); henceforth we expect a rate of convergence of order at best. We evaluate on a regular grid with and . Thus is large enough so that becomes negligible for and is smaller than to avoid numerical discrepancies. For tractability purposes, we wish to avoid the use of any relationship between the nodes . Indeed, whereas it is quite easy to label and in the sparse tree case, it is a bit more difficult to track the parent of each individual in the full tree case if we do not want to double the memory. As a consequence, we simply reformulate (15) into
We take for the bandwidth according to Theorem 2 to serve as a proof of concept. Data-driven choices could of course be made, such as the Goldenschluger and Lepski’s method [GL2, DHRR], and improve the already fairly good results shown in Figure 2. Finally, we also test whether taking into account variability in the growth rate improves significantly or not the estimate of replacing by its mean value everywhere in Formula (22), thus ignoring growth variability in that case.
We display our numerical results as specified above in Figures 1, 2 and 3. Figure 1 displays the reconstruction of on the full tree scheme for a simulated sample of size . At a visual level, we see that the estimation deteriorates dramatically when the variability is ignored in the region where is small, while our estimator (22) still shows good performances.
In Figure 2, we plot on a log-log scale the empirical mean error of our estimation procedure for both full tree and sparse tree schemes. The numerical results agree with the theory. The empirical error is computed as follows: we compute
where denotes the discrete norm over the numerical sampling described above, conditioned on the fact that the denominator in (22) is larger than . We end up with a mean-empirical error . The number of Monte-Carlo samples is chosen as . In Figure 3, we explore further the degradation of the estimation process on the region where is small, plotting confidence intervals of the empirical distribution of the estimates, based on Monte-Carlo samples. Finally, Table 1 displays the relative error for the reconstruction of according to (23). The standard deviation is computed as .
We also carried out control experiments for other choices of variability kernel for the growth rate. These include , so that the variability of an individual is not inherited from its parent, a Gaussian density for with the same prescription for the mean and the variance as in the uniform case, conditioned to live on . We also tested the absence of variability, with , with . None of these control experiments show any significant difference from the case displayed in Figures 1, 2 and 3.
Analysis on E. coli data
Finally, we analyse a dataset obtained through microscopic time-lapse imaging of single bacterial cells growing in rich medium, by Wang, Robert et al. [Robert] and by Stewart et al. [Stewart]. Thanks to a microfluidic set-up, the experimental conditions are well controlled and stable, so that the cells are in a steady state of growth (so-called balanced growth). The observation scheme corresponds to the sparse tree case for the data from Wang, Robert et al.: at each generation, only one offspring is followed. On the contrary, data corresponds to the full tree case for the data by Stewart et al., where the cells grow in a culture. The growth and division of the cells is followed by microscopy, and image analysis allows to determine the time evolution of the size of each cell, from birth to division. We picked up the quantities of interest for our implementation: for each cell, its size at birth, growth rate and lifetime. We consider that cells divide equally into two daughter cells, neglecting the small differences of size at birth between daughter cells. Each cell grows exponentially fast, but growth rates exhibit variability.
Our data is formed by the concatenation of several lineages, each of them composed with a line of offsprings coming from a first single cell picked at random in a culture. Some of the first and last generations were not considered in order to avoid any experimental disturbance linked either to non stationary conditions or to aging of the cells.
We proceed as in the above protocol. Figure 4 shows the reconstructed and for a sample of cells for the sparse tree data, for the full tree data. Though much more precise and reliable, thanks both to the experimental device and the reconstruction method, our results are qualitatively in accordance with previous indirect reconstructions carried out in [DMZ] on old datasets published in [Kubitschek] back in 1969.
The reconstruction of the division rate is prominent here since it appears to be the last component needed for a full calibration of the model. Thus, our method provides biologists with a complete understanding of the size dependence of the biological system. Phenotypic variability between genetically identical cells has recently received growing attention with the recognition that it can be genetically controlled and subject to selection pressures [Kaern]. Our mathematical framework allows the incorporation of this variability at the level of individual growth rates. It should allow the study of the impact of variability on the population fitness and should be of particular importance to describe the growth of populations of cells exhibiting high variability of growth rates. Several examples of high variability have been described, both in genetically engineered or natural bacterial populations [Sturm, Tan_Marguet_You_2009].
4.2 Link with the deterministic viewpoint
Considering the reconstruction formula (15), let us give here some insight from a deterministic analysis perspective. For the sake of clarity, let us focus on the simpler case when there is no variability, so that for all we have a fixed constant. Formula (15) comes from (14), which in the case simplifies further into
which corresponds to the stationary state linked to the equation
where denotes the size at time along a branch picked at random, see Section 5.1. Existence and uniqueness of an invariant measure has an analogy to the existence of a steady state solution for the PDE (25), and the convergence of the empirical measure to the invariant rejoins the stability of the steady state [PPS3]. The equality may be interpreted as follows: is the steady solution of Eq. (25), and represents the probability density of a cell population dividing at a rate and growing at a rate , but when only one offspring remains alive at each division so that the total quantity of cells remains constant. The fraction of dividing cells is represented by the term in the equation, with distribution given by