Mutation frequencies in a birthdeath branching process
Abstract
First, we revisit the stochastic LuriaDelbrück model: a classic twotype branching process which describes cell proliferation and mutation. We prove limit theorems and exact results for the mutation times, clone sizes, and number of mutants. Second, we extend the framework to consider mutations at multiple sites along the genome. The number of mutants in the twotype model characterises the mean site frequency spectrum in the multiplesite model. Our predictions are consistent with previously published cancer genomic data.
Mutation frequencies in a branching process
, and
3Corresponding author. \printeade1
class=MSC] \kwd[Primary ]60J80 \kwd[; secondary ]60J28 \kwd92D10 \kwd92D20
branching processes \kwdcancer \kwdpopulation genetics
1 Introduction
Luria and Delbrück’s famous work of 1943 combined mathematical modelling with experiment [28]. They considered an exponentially growing population of bacterial cells which is sensitive to attack by a lethal virus. The bacteria may mutate to become resistant to the virus. Lea and Coulson [27] obtained a probability distribution for the number of mutants, commonly known as the LuriaDelbrück distribution. The distribution has seen empirical evidence and become a standard tool for the estimation of mutation rates in bacteria [33]. While early formulations of the model were semideterministic, stochastic cell growth was subsequently incorporated (see [35] for a review). Notably, Kendall allowed for cells to grow as birthdeath branching processes [21].
Kendall’s twotype branching process, often referred to as the stochastic LuriaDelbrück model, has been foundational in the mathematical understanding of cancer evolution. The model and various extensions have been used to study drug resistance [16, 24, 13, 4], driver mutations [11, 10], and metastasis [29, 14, 30, 7], for example. As introduced by Kendall, wildtype (type A) and mutant (type B) cells are assumed to divide, die, and mutate independently of each other, according to
Whether the model represents the emergence of drug resistance in cancer or bacteria, the total number of mutants is of key interest. In recent years, [1, 16, 25, 22, 23, 2, 15] derived exact and approximate distributions for the number of mutants at fixed times and population sizes.
Our first objective is to offer a mathematically rigorous account of the twotype model, looking at the number of mutants, mutation times, and clone sizes (a clone is a subpopulation of mutant cells initiated by a mutation). Both previously known and new results are presented. We explore small mutation limits and longterm almost sure convergence. Specialising to neglect cell death, we give some exact distributions.
Our second objective is to introduce a neutral model of cancer evolution, which keeps track of mutations at sites on the genome. A site refers to a base pair. In our multiplesite model, each cell is labelled by a sequence , where means that the cell is mutated at site . The number of mutants with respect to a particular site follows the twotype model. Thus many of the twotype results are applicable in the multiplesite setting.
A standard summary statistic of genomic data is the site frequency spectrum. It is defined as the number of sites who see mutations in cells, for . We prove that the mean site frequency spectrum can be approximated with a generalisation of the LuriaDelbrück distribution. This result is consistent with cancer genomic data presented in [34, 5].
Of course, many works have attempted to predict mutation frequencies in cancer. Prominent examples are [34, 5, 31, 8], who gave approximations for the mean site frequency spectrum in a population of cancer cells. Every one of these works and countless others have used the infinite sites assumption, which says that each mutation occurs at a unique site. However, recent statistical analysis of cancer genomic data has refuted the validity of this simplification [26]. We do not use the infinite sites assumption, and make a theoretical argument against it.
The rest of the paper is organised as follows. In Section 2, we introduce the twotype model. In Section 3, we present longterm almost sure convergence results. In Section 4, we define and study the large population small mutation limit. In Section 5, we look at the large time small mutation limit. In Section 6, we present results on the number of mutants at a finite population size. In Section 7, we introduce the multiplesite model and present results on the site frequency spectrum. In Section 8, we discuss the multiplesite model in relation to recent works and data. In Section 9, we present proofs of our main results. See Appendix A for a generalisation of the results of Sections 4 and 5.
2 Twotype model
The wildtype cells grow as a linear birthdeath process , with birth and death rates and respectively. That is to say is a continuous time Markov process on with transition rates
The initial number of wildtype cells is fixed. The mutation rate is . Mutation events occur as a Cox process with intensity . The mutation times are
for . Each mutation event initiates a clone which grows as a linear birthdeath process with birth and death rates and . Clones grow independently of the wildtype growth and mutation times, and are represented by the i.i.d. processes for , with . The total mutant population size at time is
Write and for the fitnesses of the wildtype and mutant cells. We shall only be concerned with the case of supercritical wildtype growth,
Note that the process counting the number of cells, , is a Markov process on , with transition rates
We are interested in the process at a fixed time , and at the random times
and
for . Trivially, .
A classic application of the model is the emergence of drug resistance in cancer. Here, type A and B cells represent drug sensitive and resistant cells respectively. While the age of tumour is typically unknown, its size can be measured. Thus the times are relevant.
Another interpretation of the model is metastasis. Here, type A cells make up the primary tumour, and the clones represent secondary tumours. In this case the times are relevant.
3 Large time and population limits
Keeping the mutation rate fixed, the longterm behaviour of the model is mostly already well understood. Durrett and Moseley [11] study the case . Janson [17] studies a broad class of urn models, which encompasses Kendall’s model in the case . We do not present results as detailed as Janson’s. Our aim for this section is not to offer a comprehensive study, but rather bring together basic results which give valuable insight.
First we make note of a classic result:
(1) 
almost surely (see [3] or [9]). Here,
where the Bernoulli and the Exponential() are independent.
Remark 3.1.
The event that the wildtype population eventually becomes extinct agrees with the event almost surely.
We see a trichotomy, depending on the relative fitness of wildtype and mutant cells. Part 1 of Theorem 3.2 is a special case of [17, Theorem 3.1], and part 3 is [11, Theorem 2].
Theorem 3.2 (Large time limit).
The following limits hold almost surely.

For ,

For ,

For ,
The limit random variable comes from (1). The limit random variable is valued with mean
The full distribution of is given in [2, Section 4.3], which we do not state here for the sake of brevity.
For , conditioned on wildtype nonextinction, any individual clone ultimately makes up zero proportion of the mutant population. That is to say, conditioned on ,
almost surely. We say that the mutant population is driven by the wildtype growth. This is seen in the limit random variables’ dependence on .
For , early arriving clones make an important contribution to the mutant population. Conditioned on ,
almost surely. Note that if , then [11]. The are i.i.d. with distribution , where Bernoulli and Exponential() are independent. We say that the mutant population is driven by the clone growth.
To see the asymptotic behaviour of the number of mutations, simply consider in Theorem 3.2:
almost surely.
As corollaries to Theorem 3.2 we obtain large population limits. Note that conditioned on , almost surely.
Corollary 3.3 (Large wildtype population limit).
Conditioned on
, the following limits hold almost surely.

For ,

For ,

For ,
Corollary 3.4 (Large total population limit).
Conditioned on , the following limits hold almost surely.

For ,

For ,

For ,
Note that . In case 1, the wildtype and mutant cells come to coexist in a constant ratio. In cases 2 and 3, the mutant cells eventually dominate the overall population, with
(2) 
almost surely.
4 Large population small mutation limit
A tumour may comprise around cells upon detection, with mutation rates per base pair per cell division estimated as in colorectal cancer [19], for example. Hence, a biologically relevant limit can be found by taking the population size to infinity and the mutation rate to zero, while keeping their product fixed.
Suppose that is a sequence of mutation rates satisfying
(3) 
for some . For each , consider the twotype model with mutation rate . For the wildtype population, mutant population, clone sizes, number of mutations, and mutation times, write , , , , and respectively. Write
and
First we see a connection between the times and in the large limit.
Proposition 4.1.
Conditioning on ,
in probability, as .
All of our large population small mutation limit results will hold both in terms of the wildtype population and total population size. That is to say, using or as the time variable will yield the same distributions in the large limit. To save writing each result twice, we introduce the sequence , which may refer to or .
Underlying all subsequent results of this section is that the times of mutation centered about converge.
Theorem 4.2 (Mutations times).
Conditioning on ,
in finite dimensional distributions, as . is a Poisson process on with intensity .
A direct consequence of Theorem 4.2 is that for each , conditioning on ,
in distribution, as . In particular, has Gumbel distribution:
Next we look at the number of mutants.
Proposition 4.3 (Number of mutants).
Conditioning on ,
in finite dimensional distributions, as . The and are independent.
In particular converges in distribution to
(4) 
where , and are i.i.d. Exponential() random variables independent of the and .
Here corresponds to the age of a randomly selected clone, and the size of the clone. From [3, page 109],
(5) 
and so
The function is Gauss’s hypergeometric function , and , which is a clone’s ultimate extinction probability if . The third equality of (4) can be seen by making a change of variable , and then using a standard integral representation for (for example [20, C.8]).
Clearly is a compound Poisson random variable (4), and has generating function
(7) 
This recovers recent results of Kessler and Levine [23] who provided a heuristic derivation of this expression, and Keller and Antal [20] who derived it for a deterministic exponentially growing wildtype population. Its large limit appeared in Durrett and Moseley [11] for (see [20] for a discussion). If and , (7) reduces to the LuriaDelbrück distribution [27]:
Remark 4.4.
Of potential interest is the number of clones of a given size, perhaps above some lower limit for reliable detection. Let be a subset of . Consider
giving the number of clones whose size is in at time .
Proposition 4.5 (Number of clones of a given size).
Conditioning on ,
in distribution, as .
Consider
giving the size of the largest clone at time .
Proposition 4.6 (Size of largest clone).
Conditioning on ,
in distribution, as . Here, .
For an example we take the simplest choice of mutant cell growth: and . The number of clones above size is
The size of the largest clone is
Remark 4.7.
In this section we have considered a limit in which the product of the population size and mutation rate, , remains finite. It should be noted that alternative limits are also possible here. For example, Kessler and Levine [22] investigate large . In a different twist, Hamon and Ycart [15, Theorem 1.1] take the initial population size to infinity, the time of measurement to infinity, and the mutation rate to zero.
5 Large time small mutation limit
Here we investigate results similar to Section 4, but with a view to approximating the process at a fixed time rather than population size. Let be a sequence of nonrandom times converging to infinity, and a sequence of mutation rates satisfying
for some . For each consider the twotype model with mutation rate . We use the superscript notation established in Section 4.
Proposition 5.1 (Mutation times).
As ,
in finite dimensional distributions. is a Cox process on with intensity , where is distributed as (1).
Proposition 5.2 (Number of mutants).
As ,
in finite dimensional distributions. The are independent of .
Observe that
(8) 
where conditioned on is Poisson distributed with mean . The generating function of is
where is the clone size generating function, given by (4).
Remark 5.3.
The number of clones of a given size and the size of the largest clone can be seen in the large time small mutation limit. Simply replace with in Propositions 4.5 and 4.6.
Finally, we comment that the large time small mutation limit justifies a common approximation of the model, in which the wildtype population grows as . Here corresponds to defined in [11], for example.
6 Finite size results
For simplicity, we consider in this section. However, it should not be too difficult to extend to arbitrary initial cell numbers.
We are able to give the distribution of in the special case of no wildtype cell death.
Proposition 6.1.
For ,
(10) 
where are Poisson processes with intensity , , Exponential(), and Uniform[0,1], which are all independent.
To interpret (10), let’s consider a randomly selected type cell, labelled , of the cells present just before time . The cell has been alive for time , and initiated mutant clones, with mutation times for . The clone sizes are .
The mean number of mutant cells at time is
The generating function of is
where is given by (5). The computation is lengthy but straightforward; one can apply the integral expression [20, C.8] for the hypergeometric function, and the identity [20, C.10]. As in Remarks 4.4 and 5.3, for ,
(11) 
exists. The limit can be obtained using the method of [20, Section 6] (which is based on [12]), but is too cumbersome to include here. Powerlaw tails have often appeared in twotype branching processes, but were generally considered to be an artefact of approximation [11, 35].
Remark 6.2.
Next, specialising further to neglect wildtype and mutant death, we connect the distributions of the and .
Lemma 6.3.
For , and integers ,
A similar result was given by Janson [18, Lemma 9.1] for a different class of urn models. Although Lemma 6.3 can be combined with Proposition 6.1 to determine the distribution of , it does not seem likely that a tractable explicit expression can be obtained in general. However, for neutral mutations, Angerer was able to solve a recursion for the probabilities [1, Corollary 2.2].
Proposition 6.4 (Angerer).
For and ,
7 Multiple site model and site frequency spectrum
In the case of neutral mutations, we extend the twotype model to consider mutations at multiple sites on the genome.
The overall population grows as a birthdeath branching process. Cells divide and die at rates and , where . Consider sites, labelled . Each cell is labelled by some , where corresponds to a mutation at site . Initially there are an arbitrary number of cells all with label .
The mutations are modelled to occur in such a way that the number of mutants with respect to a particular site follows the twotype model. At each division event the parent cell dies, and two daughter cells are produced. The daughter cells inherit the parent’s mutations and may receive further mutations. Suppose that site is not already mutated in the parent cell. With probability site does not receive a mutation in either daughter cell. With probability exactly one of the daughter cells receives a mutation at site .
To state this more precisely, let us consider a parent cell with label dividing. The two daughter cells have labels and , where for each
Remark 7.1.
For our purposes, we do not need to specify the joint distribution of .
Remark 7.2.
As in Kendall’s model, we neglect back mutations, and neglect the event that a cell division sees both daughter cells receiving the same mutation (see [25] for a biological justification).
For each , let be the number of cells at time with .
Now we establish the connection between the multiplesite model and twotype model. Put , , . Then for each ,
The site frequency spectrum is defined to be the number of sites who see mutations in a given number of cells, i.e. the sequence
By linearity of expectation, the mean site frequency spectrum is determined by
(12) 
Antal and Krapivsky [2] found the distribution of , by solving the Kolmogorov equations. For brevity, we do not state their result. To see the mean site frequency spectrum at a fixed population size, define
as in the twotype model. Then for and ,
by Proposition 6.4.
Let’s return to the general setting of and . We briefly comment on the long term behaviour of the site frequency spectrum. The number of sites who are mutated in a given number of cells converges to zero: for any
almost surely. Let . The number of sites who are mutated in at least proportion of the population converges to :
almost surely, due to (2).
Next we look at the large population/time small mutation limits. Take a sequence of mutation probabilities . For each consider the multiplesite model with mutation probability , birth rate , and death rate (with ). Write for the population size and for number of site mutants at time . Write
The purpose of choosing the sequence of death rates in this way is to allow for a straightforward adaptation of the twotype results.
Proposition 7.3 (Large population small mutation limit).
Proposition 7.4 (Large time small mutation limit).
Take a sequence of times converging to infinity, with
for some . Then
where is distributed according to (8) with and .
Remark 7.5.
Since the size, rather than age, of a tumour can be observed, we are most interested in the large population small mutation limit. To give the reader an idea of its appearance, in Figure 1 the mean site frequency spectrum as given by Proposition 7.3 is plotted. The theoretical result is compared to simulations, with birth, death and scaled mutation rates taken from biological literature. In particular, we consider and (per day), which were estimated in colorectal cancers by [6]. According to [19], may be of the order of ; we consider two different values for in this region. We take a relatively small population size of and number of sites , so that computation time is reasonable. It is expected that taking larger and fixed will give an even closer fit between theory and simulations.
8 Discussion
From a single cancer cell a tumour may grow to comprise billions of cells. Mutations can occur at cell divisions, ultimately leading to great genetic diversity within a single tumour. With the advent of nextgeneration DNA sequencing, vast quantities of cancer genomes have been sequenced. Data has been made publicly available through the Cancer Genome Atlas and International Cancer Genome Consortium, for example. Considerable efforts have been made in recent years to explain observed mutation patterns with mathematical models, and from the observed mutation patterns to infer the evolutionary history of tumours.
Striking examples are Williams et al. [34] and Bozic et al. [5], who consider deterministic and branching process models respectively. They both derive that the expected frequency of mutations occurring in proportion of cells has density proportional to (away from ). In [34], 323 out of 904 cancers considered are deemed to fit the powerlaw. In [5], 14 out of 42 cancers are deemed to fit the powerlaw.
The models of [34, 5, 31, 8] all used the infinite sites assumption, which states that each site can mutate at most once over the lifetime of a tumour. Statistical analysis of cancer genomic data refutes this assumption [26]. Furthermore, we make a theoretical argument against the infinite sites assumption in the branching process setting. According to Proposition 4.2, the number of times a particular site has mutated before the population size reaches is approximately Poisson(). Therefore the infinite sites simplification may be appropriate when is much smaller than . However [34] estimated effective mutation rates, , of single base pairs to be in the region of . If a detected tumour comprises cells (e.g. [5]), then is not sufficiently small.
In Proposition 7.3, we have shown that the mean site frequency spectrum can be approximated by a well known generalisation of the LuriaDelbrück distribution. The distribution’s tail agrees with theoretical predictions and data in [34, 5]. But our predictions disagree at the lower end of the frequency spectrum. Due to unreliable data, [34, 5] did not make a modeldata comparison for mutations occurring in less than of cells.
In upcoming work we extend the multiplesite model to nonneutral mutations.
9 Proofs
Proofs for Section 3
Proof of Theorem 3.2.
For part 2, one needs to observe that
is a martingale with respect to the obvious filtration, and is bounded in .
For part 1, the reader may refer to [17] for a full proof in a more general and notationheavy setting. For the reader’s convenience, we offer the essence of Janson’s proof here. Crucially,
is a martingale. Janson obtains bounds for the probabilities
via Doob’s martingale inequality, and then applies the BorelCantelli lemma. ∎
Proofs for Sections 4 and 5
For each , the joint distribution of
(13) 
has been specified, with respect to the mutation rate . Note that the distributions of and do not depend on . We will construct the sequence (13) ranging over on a single probability space in a way that allows weak convergence to be shown via almost sure convergence.
On define the independent processes , for , and . As one would expect we take and . Take to be a Poisson counting process with intensity .
Define the mutation counting process by
The mutation times are given by
The total mutant population is
So the only dependence on comes from the mutation times. As before, define
and
The large population small mutation limit results all involve conditioning on or . Lemmas 9.1 and 9.2 will demonstrate that the results can be equivalently formulated by instead conditioning on nonextinction of the wildtype population.
Lemma 9.1.
Suppose that and are sequences of events, such that

,

, and

.
Then
if it exists.
Proof.
Write
and take . ∎