Widespread bursts of diversification in microbial phylogenies

Alice Doucet Beaupré1,2*, James P. O’Dwyer2,3,

1 Program in Ecology, Evolution, and Conservation Biology, University of Illinois, Urbana, Illinois, USA

2 Department of Plant Biology, University of Illinois, Urbana, Illinois, USA

3 Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana, Illinois, USA

These authors contributed equally to this work.

* doucetb2@illinois.edu


Until recently, much of the microbial world was hidden from view. A global research effort has changed this, unveiling and quantifying microbial diversity across enormous range of critically-important contexts, from the human microbiome, to plant-soil interactions, to marine life. Yet what has remained largely hidden is the interplay of ecological and evolutionary processes that led to the diversity we observe in the present day. We introduce a theoretical framework to quantify the effect of ecological innovations in microbial evolutionary history, using a new, coarse-grained approach that is robust to the incompleteness and ambiguities in microbial community data. Applying this methodology, we identify a balance of gradual, ongoing diversification and rapid bursts across a vast range of microbial habitats. Moreover, we find universal quantitative similarities in the tempo of diversification, independent of habitat type.


Large scale microbiome sampling and sequencing [1, 2, 3, 4] have documented global microbial diversity with unprecedented scope and resolution. The tools currently applied to these data allow us to quantify the amount and type of diversity found in microbial communities [5, 6, 7], and yet we know remarkably little about the the underlying community dynamics and tempo of diversification that generated the biodiversity we observe. This gap in our knowledge calls out for robust new ecological and evolutionary theories that will allow us to connect mechanisms to observed patterns [8].

To address this challenge, we introduce a new methodology to bridge the gap between biological process and observed microbial biodiversity. Our approach leverages the inference of dynamical processes from evolutionary trees, previously applied to understand large-scale evolutionary structure[9, 10, 11, 12, 13, 14, 15], and also the dynamics of viral populations on shorter timescales[16, 17, 18, 19]. We also incorporate the recent identification of bursts of diversification in microbial phylogenies [20]. The result is a model which includes traditional, slow processes for gradual speciation (one lineage goes to two lineages, which we call the ‘birth’ of a lineage) and extinctions (one lineage disappears, which we call ‘death’), together with a third set of mechanisms, incorporating the process of ecological innovation potentially followed by radiative diversification.

We apply our framework to data spanning 13,500 individual samples, 56 habitat types, and 29 biomes [1], finding a previously unidentified balance of fast and slow evolutionary processes in these data, and a tendency towards universal behaviour in the quantitative description of diversification. We cannot directly quantify the traits and their changes through time that may have led to a given combination of rapid and gradual processes, but our results are strongly suggestive of a centre-ground in the long-standing debate over phyletic gradualism versus punctuated equilibrium [21].

Materials and methods

0.1 Coarse-grained phylodynamics

Our knowledge of the diversification of a group of organisms is often characterised by the branching of their evolutionary lineages, reconstructed using genetic sequence data sampled in the present day. We can think of an evolutionary tree, also known as a phylogeny, as the input data for inferring theoretical models of diversification, a methodology known as phylodynamics [16, 12, 13, 9, 14, 15, 18, 19, 10, 11]. This approach is usually thought of in terms of speciation and extinction rates. But in recent work [20] we identified patterns of bursts in the branching in microbial phylogenies. Therefore, in addition to the traditional speciation-extinction process, we also parametrize fast (but brief) bursts of diversification. In these burst processes, we propose that a lineage will undergo a much faster rate of diversification, , but for a very short time, . The origin of a burst could e.g. be a key functional change that opens the opportunity for a rapid radiation [22, 23], or a disturbance which opens up a new habitat to be colonized. In the following we will refer interchangeably to this third process as bursts or innovations.

There are two problems with inferring the parameters of this generalized innovation process when is large. First, there will be parts of any reconstructed phylogeny where we may not have enough information in our sequence data to distinguish the true ordering of very fast branching events, as shown in Figure 1. Even if we did have longer sequences, there is always a speed limit on what kinds of process we can accurately infer from these data, making these rates hard to infer. Second, in any realistic evolutionary history we would expect many different rates , corresponding to the idiosyncrasies of individual events. In this new approach, we (partially) bypass both difficulties by applying a method of coarse-graining, where we decompose a sample phylogeny of age into slices of width (Fig. 2 Panel A). In a coarse-grained phylogeny we are no longer trying to resolve down to each binary split in the tree—what we have access to are the ‘chunks’ of diversification between time slices (Fig. 2 Panel B), which define equivalence classes of binary trees.

Figure 1: Uncertainty generates polytomies. Periods of fast diversification leave little to no signal in sequences with a limited number of base pairs, meaning that we cannot always distinguish between different possible orderings of diversification events using short sequences. These ambiguities often rightfully show up in bootstrap consensus trees or during calibration. Considering the tree as an ensemble of transitions happening over a coarse-grained time interval alleviates this issue and allows the inference of effective parameters associated with faster processes.

Surprisingly, there is a way to bypass this speed limit, by leveraging the distribution of sizes of these (apparent) bursts of branching. Even though we can’t resolve phylogenies down to the shortest timescales, this distribution still carries information about the parameters of the innovation process. The catch is that we cannot distinguish between different values of and independently, as the distribution of burst sizes is a geometric distribution which only depends on the product of diversification rate and diversification time, . By looking at the evolutionary history through a blurred lens, we therefore collapse a multi-parameter family of models into a single parameter, reminiscent of the loss of information under coarse-graining in physics—so that at a sufficiently coarse resolution, many different fine-scale models map onto the same effective theory.

The final steps in our pipeline (Fig. 2 Panels C and D) involve computing likelihoods for a given model of innovation using the chunks defined by coarse-graining. We considered three distinct models in this preliminary work: the traditional speciation-extinction process (abbreviated BD, for birth and death); our basic speciation, extinction and individual burst process (BDI), which assumes a single value of the product ; and finally what we call speciation, extinction and heterogeneous bursts (BDH), where we compound the innovation processes using a distribution over the product . BDI and BDH offer an agnostic and parsimonious alternative to models with time-varying [9, 10, 11], trait-dependent [12, 13], and diversity-dependent [14, 15] models.

The likelihoods are then calculated via numerical solutions of a master equation, described below, for the probability that an initial lineage (at the beginning of any chunk) will have branched into lineages at the end of the chunk conditional on each lineage having at least one extant descendant. Each chunk represents a properly weighted sum over many histories compatible with uncertain tree structures and therefore helps us bypass the need for an accurate estimate of branch lengths.

Microbial phylogenies usually span enormous amounts of evolutionary time with relatively short sequences and may suffer from a lack of phylogenetic signal sufficient to reconstruct early ancestral states, which in turn imposes an horizon deep in the tree beyond which topological inaccuracies become inevitable [24]. Furthermore, slices closer to the present contribute much more weight in the total likelihood than those from an otherwise noisy and uncertain past because they contain more chunks. For example, in Fig. 3 below, consider the intersection of the cumulative distribution and the y-axis equals the number of chunks in that slice. We see that the slice closest to the present (top-left panel) contributes at least an orders of magnitude more chunks than the earliest ones (panels in bottom two rows) where low phylogenetic signal potentially degrades the quality of the input phylogeny.

What happens if we apply this methodology to a perfect, fully-resolved tree? Reassuringly, we show in the Supplementary Information that in this limit and using a very large number of slices our inference using the BD model exactly recapitulates previous approaches [25, 26, 27, 28, 9]. On the other hand, in cases where we do have limited resolution due to short sequences, our method extends the current applicability of BD models by allowing the inference of rate parameters over incompletely-resolved phylogenies. We can then distinguish between our three nested models using likelihood ratio testing, and we also perform an exact goodness-of-fit test by comparing a given empirical tree to an ensemble of typical trees generated by the model and its parameter estimates constrained to the same empirical size and depth (see Supplementary Information and Fig. S4).

Figure 2: Coarse-grained phylodynamics inference pipeline. (A) We cut a phylogeny of depth at a predetermined number of slices , based on sequence length. Slice boundaries isolate pieces of the tree that begin with a single lineage further in the past and end at lineages at the next slice boundary closer to the present. The isolated dark-grey piece above begins in the past at with one lineage and ends closer to the present at with lineages. (B) Once sliced, the tree decomposes into many isolated pieces, or chunks. Each chunk is identified by : its origin at , its end at the next slice boundary , and the number of observed lineages at time . (C) To each chunk corresponds a conditional transition probability which traces over all unobserved, extinct lineages (red), and over all histories consistent with observed lineages (gray), all of them with extant descendants (blue). (D) We form the full log-likelihood of observing a given tree by adding together all individual chunk log-likelihoods. Parameter estimates (denoted generically by ) are obtained by maximizing the log-likelihood.

0.2 Data availability

All amplicon sequence data and metadata have been made public through the data portal (qiita.microbio.me/emp) and all accession numbers to the European Nucleotide Archive (www.ebi.ac.uk/ena) can be found in Table S1. All processed individual phylogenetic trees used in this study will be made public prior to publication.

0.3 Code availability

Custom code for the tree coarse-graining, maximum likelihood inference, and Markov Chain Monte Carlo (MCMC) goodness-of-fit test is stored in a private GitHub repository and will be made public prior to publication.

0.4 Dataset and data preparation

As an initial application, we considered the Earth Microbiome Project (EMP) 10K (as of 2015), drawing from 5.59 million representative 16S rRNA sequences (open reference 97% similarity cut-off, V4 region[29]) gathered from 56 different studies and breaking into 13,500 separate samples [1]. Table S1 lists these studies, together with a short description as it appears in the EMP metadata, and a clickable URL to their Qiita entries and accession number or study website. These samples span 29 biomes and 56 habitats, including host-associated habitats from 93 different types of host. The EMP dataset is available on the project website (www.earthmicrobiome.org) as single large phylogenetic tree reconstructed using FastTree [30, 31] and representative sequences for each taxonomic unit from all habitats through the Qiita data portal (qiita.ucsd.edu).

The first critical question to address is to what extent we should aggregate samples when applying our phylodynamic inference. Should all human-associated samples be aggregated, or all soil samples? We know that the ancestors of present-day organisms in each sample were unlikely to be co-located, experiencing the same environmental context for their entire evolutionary history. On the other hand, we don’t want simply to pool all samples together; microbial life from sufficiently different habitats is likely to have different underlying evolutionary and ecological processes. For this study we took the approach of inferring parameters sample by sample, but it would be straightforward to aggregate any particular group of sampled data. For example, questions pertaining to phylogeography may necessitate the use of a spatially hierarchical aggregation scheme. We began with the large EMP phylogeny containing all representative sequences and we transformed it into a chronogram using the method of mean path lengths (MPL) [32, 33] weighted by relative operational taxonomic unit abundances. We chose this method because likelihood and semi-likelihood approaches for ultrametrizing phylogenies do not scale well for trees with more than 1,000 to 10,000 branches or leaves. MPL on the other hand requires only two traversals of the tree and therefore scale linearly with the size of the tree—ultrametrizing the full EMP phylogeny with MPL takes about 5 minutes on a consumer laptop. Alternatively supertree methods[34] could also be used for transforming in a reasonable amount of time very large phylogenies into a chronogram.

This second step is necessary because, like all phylodynamic approaches, our models are based on changes through time, whereas our reconstructed trees are based on sequence divergence. Transforming the tree into a chronogram puts the two on the same footing. We did not need to calibrate the root of the full EMP tree in actual Myr or Byr because only the dimensionless rates and exponents matter for the quantitative comparison and characterization of gradual (slow) vs bursty (fast) evolution. From this large phylogeny we then pruned a smaller subtree for each individual sample. These individual sample phylogenies form the input data for our analyses.

0.5 Model definitions

All models considered in this study (BD, BDI, and BDH) are constructed from four building blocks which create and destroy lineages which we will represent by the letter , namely the stochastic birth process


with per-lineage birth/speciation rate , the death process


with per-lineages death/extinction , the innovation process


where is the per-lineage innovation initiation rate and transition probabilities for and 0 otherwise. The parameter characterizes the geometric burst-size distribution. Finally we compound the innovation process with a beta distribution over . The beta distribution is parametrized by its precision and mean to obtain the heterogeneous innovation process


where is the per-lineage heterogeneous innovation initiation rate and transition probabilities for and 0 otherwise. The beta distribution is sometimes parametrized by two shape parameters, and , in our case controlling the behavior of the parameter , now a random variable, around 0 and around 1, respectively. In terms of those shape parameters the precision and mean .

For each process there is an associated stochastic generator which encodes the instantaneous transition rates between different number of lineages presented in equations (1) through (4). Generators for the above processes are respectively given by


To combine models one adds generators together. For example the generator of the BDH model is written with associated parameter set .

0.6 Coarse-grained phylogeny likelihood

The coarse-graining/slicing operator decomposes a sample phylogeny of age into slices of width . The resulting coarse-grained phylogeny is characterized by a multiset of chunks , with the index map given some arbitrary tree traversal. The log-likelihood of observing under a given models with parameter set is written


and maximum likelihood parameter estimates


0.7 Chunk likelihoods

The expression for individual chunk likelihoods is given by


The chunk generating function


where is the probability generating function


solution to the master equation


with initial condition . The generator is one of , , or . The subtraction of in the numerator and normalization by in the denominator account for conditioning on non-extinction of at least one observed extant lineage. Details on the derivation, generalization to incomplete sampling of lineages, and numerical computation of the above quantities can be found in the Supplementary Information.

0.8 Model comparison and goodness of fit

The BD model is nested into BDI, and BDI into BDH, therefore we can perform a likelihood ratio test by comparing the -statistic


against a distribution with number of degrees of freedom . For we have , and for we have .

We perform an exact goodness of fit test by sampling the -statistic over the space of possible input coarse-grained phylogenies constrained by the depth and number of leaves of the empirical sample phylogeny. This statistic is given by


where the first sum runs over slices and the second one over all chunk sizes found in slice . In the summand represent the number of chunks of size , and the total number chunks, in slice . The -statistic corresponds to the information divergence between the empirical and the theoretical chunk size distribution. The exact goodness of fit statistic is given by the fraction of coarse-grained phylogeny with greater than the empirical value where ’s are sampled using the MCMC Metropolis-Hasting algorithm poised at the maximum likelihood parameter estimates. We describe the algorithm and coarse-grained proposal distribution in the Supplementary Information. This step is necessary because the number of degrees of freedom of trees with fixed depth, fixed number of leaves, and fixed number of slices is unknown.


We rejected the BD hypothesis in favour of innovation (the BDI or BDH model) according to a likelihood ratio test for nested models (BD  BDI  BDH) at significance level , corresponding to a Bonferroni corrected family-wise error rate level . In cases where BD was rejected, we use the maximum likelihood estimates for the parameters of the BDI or BDH process to analyze the balance of slow and fast processes (birth rates vs innovation rates) and the phylogenetic signature of fast processes themselves (the distribution of burst sizes). Figure S1 shows that BD was rejected in favour of BDI in 98% of samples. Subsequently, our basic innovation model was rejected in favour of heterogeneous innovation in 80% of samples. To understand why BDH is clearly selected in many cases it is instructive to look at a typical example. Figure 3 shows how BDH captures the fatter tail of the empirical chunk size distribution across all slices of a coarse-grained phylogeny, while BD (and to an extent BDI) fail. Nonetheless, Fig. S4 shows that for 57% of samples BDI and BDH are sufficient to recapitulate the phylogeny, while for 43% of samples both BDI and BDH fail the goodness-of-fit test, which suggests that we need a more complex model of innovation to account for them.

Figure 3: Cumulative chunk distributions across slices are approximately distributed as a power law. The empirical tree for this particular case comes from the gut microbiota of captive colobine primates[35] (stomach mucosa of a captive Northern Douc, sample ID 45300SDZ3.F1.Pnem.stom.609719, study MetSan, see Table S1) The upper left plot stands for the distribution within the slice closest to the present, i.e. including the leaves. Time increases toward to past from left to right and from top to bottom. The red line represents the maximum likelihood distribution fitted using the heterogeneous innovation model, the blue line the basic innovation model, and the gray line the standard speciation-extinction model. Note that all time slices use the same parameter estimates for a given models—i.e. the model is fitted using the whole tree, not tuned slice-by-slice.

Beyond looking at single studies, our results can be expressed in terms of three important messages. First, the mixture of bursty (in the form of innovation events) and gradual diversification is clearly preferred over pure gradualism in the vast majority of our samples. To explain the dynamics implied by reconstructed evolutionary trees, we need both ongoing, slow diversification, and bursts of faster diversification that last for a relatively short time. Second, even though fast processes by definition produce more lineages per unit time while they are in play, the initiation of bursts of any size is also more common than slow gradualism in samples that show evidence of heterogeneous innovation. Finally, in the right hand panel of Fig. 4 and in Fig. S2-S3, we document the distribution of parameters controlling the shape of the burst size distribution for the heterogeneous innovation process. The distribution is beta-geometric and at large burst size this distribution behaves as where . The effective exponent of this power law is clustered with median and quartile coefficient of dispersion , a surprisingly narrow range of values, independent of the other estimated parameter values, the habitat, or the study. The apparent universality has echoes in recent work [17, 20], and in the long history of studying scaling in evolutionary history, for example in the number of species per genus in a given taxonomic group [36]. Our current analysis goes beyond documenting these patterns, by connecting this universality to a mechanistic interpretation.

Figure 4: Distribution of parameter estimates for the heterogeneous innovation model. (A) Each dot represents the effective rate of initiation of fast bursts vs the net rate of gradual diversification of an individual sample. We obtain comparable dimensionless effective rates by multiplication with the age of the root in each sample. (B) The histogram shows the distribution of the BDH exponent characterizing the power law tail of heterogeneous innovation burst sizes. Smaller inferred values of imply that the burst size distribution has a heavier tail, and we identify a clustering of values across habitats. (C) Each dot represent the exponent of individual samples broken across different types of biome. Biomes are sorted according the their median. Boxes and whiskers indicate quartile limits.


We have introduced a new methodology to interpret what diversity in environmental sequence data can tell us about the ecological and evolutionary processes that shaped it. The key theoretical step in our new method is to recognize that faster diversification processes which appear intermittently and last only for a short time, still leave a signature in imperfectly reconstructed phylogenies. This signature persists even when the quality and length of our sequence data and consequent resolution of the phylogeny is relatively low compared to the timescale of the processes. Combining this realization with existing methods for inferring slower gradual processes from phylogenies, we were able to quantify the balance of fast and slow processes, and the parameter values that best describe the structure and distribution of burst sizes. Our conclusions in applying this to a large dataset encompassing heterogeneous habitats are stark: we almost always need these heterogeneous faster processes to complement gradual diversification in order to explain these data, and the parameters that best explain observed structure of evolutionary trees are surprisingly universal across studies and environmental context.


These results raise many questions, and open a number of doors for future investigation. Perhaps the primary open question is: what biological changes cause the bursts we observe in empirical trees? Are these genuinely due to innovations, where an adaptation opens the door to many further adaptations [22, 23, 37, 38]? They could also be the result of exploration of new habitats, disturbance opening up niche space to be invaded [39, 40], or something else entirely. Our current analysis cannot answer these questions clearly, but the evidence does clearly show that an explanation is necessary. Second, we have shown that a class of distinct fast processes all map on to the same observable phenomena at coarse temporal resolutions through a combination of their parameters. This is a quantitative example of a long-discussed idea in ecology that only a handful of parameters survive to describe phenomena at larger or longer scale. The assumption is inherent in neutral models, but also in other, simplified models of macroecological patterns [41]. Our approach can form the starting point of quantitatively understanding which parameters and processes ‘upscale’, and which do not. Finally, why do we see such clearly convergent patterns across divergent habitat types? The ecological and evolutionary constraints leading to the patterns we’ve seen deserve a fuller explanation.


A.D.B. was supported by a CompGen Fellowship from the University of Illinois and by the Cooperative State Research, Education, and Extension Service, US Department of Agriculture, under project number ILLU 875-952. J.O.D. acknowledges the Simons Foundation Grant #376199, McDonnell Foundation Grant #220020439, NSF #DEB1557192 and Templeton World Charity Foundation Grant #TWCF0079/AB47. Sample processing, sequencing and core amplicon data analysis were performed by the Earth Microbiome Project (www.earthmicrobiome.org), and all amplicon sequence data and metadata have been made public through the data portal (qiita.microbio.me/emp).

Author contributions

A.D.B. and J.O.D. designed the study and wrote the manuscript. A.D.B. developed analytical tools and analysed phylogenetic data.

Corresponding author

Correspondences to Alice Doucet Beaupré (doucetb2@illinois.edu)

Supplementary Information: Widespread bursts of diversification in microbial phylogenies

1 Formalism

1.1 Generating functions (GF) and holomorphic/Fock-space formalism

The equivalence between holomorphic/meromorphic generating functions and Fock-space methods applied to classical objects has a long history in the field of nonequilibrium statistical mechanics of many-body systems, beginning with the seminal papers [42, 43]. Yet it is only relatively recently that those methods have been recognized as potent tools applicable to mathematical biology and ecology[44, 45]. Let the probability generating function (PGF) of the state at time

where represents the probability of having abundance , or more to the point, of having lineages at time . We adopt the convention that roman letters inside a ket are used to denote a monomial/unit mass/abundance state while Greek letters inside a ket denote mixtures like above. Single abundance state probabilities can be extracted by successive derivation


In the Fock-space formalism we write the explicit scalar product


and the completeness relation


The normalization of the probability distribution implies that . Evaluating a GF at a point requires the introduction of a left coherent-state, denoted with an underline,


En passant this innocent looking equation is at the core of the equivalence between holomorphic functions and state vectors in Fock spaces. It allows us to easily translate between the natural, combinatorially intuitive language of generating functions, which we will use profusely in the following to construct our coarse-grained conditioned tree observables (or chunks), and the practical linear algebra methods forming the numerical backbone of this study. Under this equivalence the normalization condition becomes . We use this particular coherent state to find the expectation value of operators, namely


Classical stochastic observables are usually diagonal, i.e . The holomorphic representation for creation and annihilation operators corresponds to multiplication and derivation by , i.e. and , and satisfy bosonic commutation relation


The master equation for a continuous-time stochastic process is written in the language as the partial differential equation (PDE)


together with initial condition . is the time evolution generator and encodes all the information about the instantaneous dynamics of the process. Brackets here denote the dependence of the generator on and , not its multiplication by the bosonic commutator. Eq. (23) admits the formal solution


This is the starting point for the numerical exponentiation scheme used below and in the main text to construct the likelihood of a coarse-grained (CG) tree.

The holomorphic formalism admits three important similarity transformations[46]: two shifts,




and a scaling transformation,


1.2 Incomplete lineage sampling

If we approximate the sampling process by a Bernouilli trial with success probability, or sampling fraction, , then the GF for the joint probability of successfully sampling individuals out of a population of or more individuals is given by


This expression captures the fact that every states with abundance greater than contribute to the probability of sampling exactly lineages.

2 Processes

2.1 Birth process

The birth process with per capita birth rate consists in the transition

The generator for the birth process is given by


The instantaneous transition rates between states are given by


The generator, in matrix form,


2.2 Death process

The death process with per capita death rate consists in the transition

The generator of the death process is given by


and the instantaneous transition rates


2.3 Innovation process

The innovation process consists in initiating at per capita rate a Yule (pure birth) process and identifying all finite-time transition probabilities with infinitesimal transition rates. To obtain the innovation process generator we first need to solve the birth process exactly. We begin by writing its formal solution using Eqs. 29 and 23 as


Using the change of variable we rewrite the evolution equation


Using Eqs. 25 and 27 we solve


where . For a given burst time-scale , higher fitness processes lead to values of closer to 1. For a single initial lineage, and we recover the geometric PGF


The generator of the innovation process

where , is therefore given by


In the second equality, we made sure to only include off-diagonal transitions in the positive term by absorbing the false transition in the negative diagonal term. Doing so also highlights the fact that the effective innovation rate is actually equal to rather than . It is tempting to interpret, in line with the Red Queen hypothesis, the rate of initiations of “invisible” innovations as capturing perhaps the rate at which lineages must constantly innovate in short bursts of higher fitness just to keep pace with co-occurring lineages and changing environments. The instantaneous transition rates are given by


and the matrix of the generator


2.4 Heterogeneous innovation process

To account to the possibility of multiple innovation processes with different values of we compound the parameter of the geometric part with its conjugate prior the beta distribution. The generator


In the last two lines we used the reparametrization of the beta distribution in terms of the precision , , and the mean , . Similarly as before the instantaneous rates


For , instantaneous rates follow the power law


The power law phase is controlled by the parameter of the beta distribution. The parameter in turn controls the shape of the density of the geometric innovation parameters around . As approaches 0 and approaches 1, the geometric distribution acquires a progressively longer exponential decay that tends towards a uniform improper distribution over all . When , the density diverge algebraically as and the compounding of wide geometric distributions give rise to power laws with tail exponent between 1 and 2. When , then the tail exponent becomes exactly 2, and when , the tail exponent becomes greater than 2. Finally the matrix of the generator


3 Maximum Likelihood approach

3.1 CG tree observables and likelihood

We extract dynamical information contained in a chronogram (ultrametric phylogenetic tree) by looking at CG transitions, i.e. single lineages going to many lineages at a time closer to the present. Those are called chunks in the main text and to each chunk is associated a tuple and likelihood chunk where is the sampling fraction. A likelihood chunk represents the probability that a unique lineage at time in the past had exactly descendants at time that each had at least 1 extant lineage in the present, while all other lineages above were unobserved by virtue of having gone extinct or of being missed during sampling of intensity . We can write using the previous formalisms


where is the solution of Eq. 23 with initial condition and is its -th derivative w.r.t. . In terms of probabilities contained in and the expression 46 for chunks is given by


This is a probability distribution over . To show this we sum over


In the first equality we added and subtracted the extinction term . In the second equality we used the semigroup/Chapman-Kolmogorov property. In the third equality we used the shift property Eq. 25. In the fourth equality we cancelled terms inside the argument of and finally used the normalization condition . All terms are positive and sum to one and therefore is a probability distribution over .

Combining the method of characteristics and numerical complex derivation, the first (holomorphic) formulation of Eq. 46 allows for the exact numerical computation of chunk likelihoods. Yet we found the second (Fock space) formulation using truncated matrix exponentials and linear algebra to be easier to implement and more stable at large values of . All the information of a given model reside inside the generator . The Greek letter is understood as the set of all parameters of a given model. We will denote the action of CGing/slicing with slice width a chronogram by


The index set runs over chunk indices following some traversal order over the CG chronogram. We can write the likelihood of a given CG chronogram


The central inference objective of this framework is to seek ML estimates (MLE) by maximizing the log-likelihood


3.2 Equivalence with the Morlon et al. likelihood

It is worth mentioning here that Eqs. 46 and 50 recover the likelihood found in [9] in the limit of infinitesimal chunk duration . This limit is also equivalent to a particular slicing scheme where each branch gets a chunk of size with duration equal the length of the branch, and each node a chunk of size realized with an infinitesimal duration. Let us first show the equivalence between the infinitesimal and branch-node slicing. Consider two consecutive branch segments of length and . Together they contribute


to the full tree likelihood. Recall the Chapman-Kolmogorov semi-group property of generating functions which ensures . Therefore


Substituting backward in Eq. 52 it follows that


This means that the product of consecutive chunks of size compounds into one chunk of size with length equal to the sum of individual chunk lengths. This in turn implies the equivalence between the infinitesimal slicing of a tree and a scheme where each branch gets its own chunk with the same initial and final times and , while its immediate downstream node inherits a chunk of size with initial and final times and . Chunks “touching the present” contribute a weight


Finally node contributions for the BD model in this scheme


Therefore the contribution of an internal branch starting at in the past and ending with a node at has weight


and the tree likelihood in the limit for a tree of depth and leaves


This is equivalent to Eq. (1) in [9] for the case of constant birth and death rates modulo a factor . Indeed for the BD model


where the absorbing state/exctinction probability


and the ratio of constant per capita death and birth rates. We already found the Yule limit previously when we constructed the innovation process. In the Yule limit , therefore and . Generalizing to time-varying rates is straightforward and the equivalence remains valid.

3.3 Chunk generating function

Using manipulations similar to those of the previous sections, we find the chunk PGF


This expression has a very intuitive combinatorial interpretation that could have been guessed from the beginning and used as the starting point for the chunk decomposition. It symbolically stipulates that at , for whichever state you are in, substitute each lineage represented by atoms with a Bernouilli trial of success probability equal to the probability of surviving across time to the present. In other words, lineages are ”preemptively” split at time into empty atoms (the monomial 1) weighted by the probability of not making it to the present, , and unit atoms (the monomial ) weighted by the probability of making it to the present, . We easily recover the extant chunk PGF when conditioning on survival by omitting the extinction term at and renormalizing. We find


and finally it is straightforward, if tedious, to verify that