Fitness Estimation in Models of Genetic Evolution of Asexual Populations

Fitness Estimation in Models of Genetic Evolution of Asexual Populations

Sergey S. Sarkisov II, Ilya Timofeyev, Robert Azencott

In this paper we develop a systematic methodology for estimating the fitness of genotypes from observational data describing the long-term evolution of bacterial populations. In particular, we develop a numerical approach for estimating genotypes’ fitnesses in locked-box models describing bacterial evolution in experimental setups similar to the celebrated Lenski experiment. Our approach is based is on a nonlinear least squares optimization procedure which takes into account both the first and the second mutation events. The methodology developed in this paper is confirmed with numerical simulations with realistic parameter values.

AMS Classification: 60J20, 92D15

1 Introduction

Studies of evolutionary dynamics, including bacterial populations, received a considerable amount of attention in recent decades starting with the celebrated Lenski experiment [15, 26]. Since then other groups contributed to experimental studies of Escherichia coli (see e.g. [3, 10, 2, 16, 6, 11, 4]). The experimental setup usually consists of multiple populations processed in parallel. These populations undergo daily growth followed by daily selections of fixed-size samples. Initially, all populations are identical and consist only of the ancestor genotype. Since evolutionary experiments are carried out for a long time (years), different random mutations occur in different populations. Therefore, the genetic composition of different populations can differ drastically over time. The main goal of such experiments is to understand the main features of evolutionary dynamics, including the rate of evolutionary change, likelihood of emergence of a new genoype, fitness landscape, etc.

The Lenski Escherichia coli long-term evolution experiment provides a considerable insight and evidence for the mechanisms of growth and adaptation of asexual organisms (e.g. [15, 26, 8]). Analysis of the experimental data shed some light upon the relative growth rates of stronger mutants and their ancestors. However, the Lenski experiment primarily focuses on the overall population adaptation [1, 27], while our goal is to develop techniques for estimating individual selective advantages for all genotypes.

The adaptive evolution of bacterial populations is driven by the emergence of positive mutations and their spread due to natural selection. Various mathematical models of this evolution process have been developed (see e.g. [12, 25]), but estimation of parameters in these models remains one of the key issues. The evolutionary dynamics of bacterial populations depends on two main parameters - selective advantages and mutation rates [10, 17, 14]. In this work we assume that the mutation rates are known and address estimation of selective advantages.

Several previous works addressed similar questions. For instance, one study involved both experimental and theoretical analyses of evolving viral populations [21, 20]. It has been able to directly predict adaptation models, including the distribution of beneficial mutation effects . However, this work was limited to the viruses, because of their small genome size, high mutation rate, and large mutation effects. By contrast, in bacterial populations, direct estimates of mutational parameters is much more complex [23] because these parameters are usually estimated indirectly by inference from observable markers [23, 10, 2]. A recent study focused on bacterial populations where a single mutant type emerges and overtakes the population [28]. Mathematical derivations and model simulations were combined to derive accurate estimates for the mutation rates and selective advantages of the main emerging beneficial mutation. In the approach developed in [28] only the last and most significant divergences were applied to obtain parameter estimates. In another study on parameter estimation [10], only the first divergences were used to perform generalized regressions. In contrast with these works, we develop more realistic formulas involving occurrence of multiple mutant types.

A typical approach in E. coli evolutionary experiments is to observe many populations simultaneously evolving in parallel (see e.g. [28]). Each population consists entirely of the ancestor type cells initially. In order to be able to observe mutation events, half of the cells are colored via a particular marker in each population (e.g. Ara marker in E. coli experiments in [28]), whereas the other half are colored via a different marker (e.g. Ara). In E. coli evolutionary experiments Ara and Ara markers correspond to growing white and red cell colonies, respectively [10]. Initially, red and white colonies coexist at the rate 50/50 in all populations. No cell dynamics is altered after applying this coloration. These asexual microorganisms reproduce via binary fission, and the genetic markers are passed down from parents to offspring [2]. Initially these population consist of cells, with typical values of . The populations are allowed to grow for 24 hours, and then a sub-population of approximate size is extracted by dilution and transferred to a new culture of fresh growth medium. This transfer step is repeated daily for all populations. The white and red cell frequencies are estimated periodically by plating cells on indicator media while visual counting errors are neglected. Therefore, after the initialization step (initialize all populations with the identical ancestor genotype colored 50/50 with the Ara and Ara markers), the experimental process of growth, mutations, and dilution is repeated for many days.

A mutation occurs if one of the colors significantly exceeds 50% in a particular population. With daily observations, the whole trajectory for one of the colors can be recorded. Since there are multiple populations undergoing evolutionary dynamics, combining all population data results in an ensemble of trajectories. In this paper we develop an optimization procedure which utilizes this ensemble data to estimate the selective advantages of all genotypes in the population. In particular, in section 2 we describe the stochastic Markov chain model which describes the evolutionary dynamics of bacterial populations. In section 3 we derive expressions for the most-likely evolutionary trajectory conditioned on the occurrence of one- and two-mutation events. In section 4 we describe the setup for our numerical experiments to generate synthetic data for an ensemble of evolutionary trajectories. In section 5 we define an optimization procedure for estimating the genotypes’ selective advantages and in section 6 we present numerical results.

2 Stochastic Model

One of the most popular mathematical models describing the biological experiments outlined in the previous paragraph are Markov-chain “locked-box” models [28, 19, 10, 22, 25]. In these models the mutation is separated from growth, and the biological experiment is modeled by the repeating sequence growth mutations dilution To construct the mathematical model, we introduce the set of genotype frequencies, , where is the discrete time measured in days and is the genotype index. Random variables describe the daily evolution of the population of genotypes ( is the ancestor genotype and the other are mutants) and, thus, genotypes’ frequencies obey the relationship . Next, we describe the three phases in more detail.

2.1 Growth

In the beginning of any day, , the population consists of a total of cells. The population can include various sub-populations of sizes , so that . The daily growth phase is modeled by a deterministic exponential growth in which cell divisions occur at fixed growth rates per corresponding genotypes . Hence, after one day, the number of cells of each sub-population grows by the corresponding growth factor , so that the grown subpopulations are of sizes and the net population at the end of day is of the size . Without loss of generality, assume that genotypes are ordered with respect to their fitnesses as . The selective advantage for each genotype is defined as and can be easily computed as


In some models (see e.g. [7, 28]) it is assumed that the amount of nutrients is limited during the day and the population can only grow to a fixed size, . However, in our model we assume that the population growth is not bounded by any external factors.

2.2 Mutation

After growth, the population is typically several orders of magnitude larger than the ancestor population, and even for small mutation rates of the order of , a non-zero number of emergent mutants can be expected. Here we consider only advantageous mutations, since deleterious mutations are typically quickly eliminated from the population due to a lower selective advantage. We assume that mutations are independent random events, occurring with equal mutation rates for . Here represents the probability of mutation; note that it does not specify the genotype after the mutation. The formalism presented here can be easily generalized for different mutation rates which represents the probability of mutation of genotype genotype . The only restriction for mutation rates is that they are small and are of the same order, i.e. for .

We introduce the transition matrix describing the probability that mutants from the genotype- subpopulation become of genotype . The mutations are modeled by the Poisson distribution with means


with denoting the number of mutants of genotype evolving from the genotype- subpopulation, and denoting the size of the genotype- subpopulation.

Since sum of Poisson random variables is a Poisson random variable, we can compute the distribution for the total number of emerging genotype- mutants. Thus, the random variable


is Poisson with mean


Therefore, the probability that genotype- subpopulation produces mutants is


In our numerical simulations we sample a Poisson random variable with the distribution given by the equation above to generate the number of genotype- mutants. Random variable describes the total number of mutants produced by the genotype- subpopulation. Next, these mutants need to “distributed” among all other genotypes with appropriate probabilities.

We consider only advantageous mutations so that cells of genotype only mutate into stronger cells of genotype , with . Therefore, since genotypes are ordered by their fitness, the mutation matrix, is upper triangular, i.e. for , since mutations are advantageous and genotype cannot mutate into itself. The adaptive behavior discussed above is of main interest in many practical applications, including the genetic evolution experiments of E.coli. However, the approach described here can be easily extended to any mutation matrix, including the case with deleterious mutation, so that is a full matrix.

As a particular example, we use the mutation matrix which describes one of the most common scenarios when every mutation occurrence in genotype- population is equally likely to become any of the stronger mutants. In this case, the transition matrix is based on a uniform distribution among mutants and becomes


and entries of the matrix can be described as


After the mutations occur sub-populations of each genotype become different and the conditional distribution for sub-populations after the mutation can be described using the multinomial distribution (corresponding to the nonzero entries) with


Consider a particular genotype- subpopulation and define the number of emigrants and immigrants as

respectively. Then, the net change in the size of the genotype- sub-population is

Given particular values of one can easily compute the changes in the size of genotype- subpopulation due to mutations. In addition, one can show that mutation step does not alter the net population size, i.e. .

2.3 Dilution

During the dilution step, a random sample of size is extracted (after growth and mutations) from the population of size . We model the dilution step using the multinomial distribution based on the frequencies of the individual genotype subpopulations, with

Thus, letting denote the number of genotype- cells selected after dilution, the distribution of , follows



The process of growth, mutations, and dilution is repeated daily for the duration of a particular experimental run. The growth is deterministic, whereas the mutation and dilution steps are random. These three steps outline the adaptation model for a single isolated population. The process modeled by these three steps is a discrete-time Markov chain for the random variables which describe the time-evolution of frequencies for the genotypes in the population. All frequencies sum up to one, i.e. , and a sample path of this process can be easily generated given the initial distribution of genotypes .

2.4 Coloring

A major element in this study is introducing a visually distinguishable feature of color that is inherited by all mutations. Without altering major properties of the cells, the entire population is partitioned into two subpopulations. It is colored with two biomarkers so that half of it becomes white and the other half, red. This use of neutral markers has been commonly used in the relevant studies (see e.g. [18, 13, 10, 5]) since this coloring does not alter the biology of the cells. The colors are inherited by the children and subsequent mutants emerging from the initial sub-population of a particular color.

The white and red coloring described above doubles the number of frequency variables in the model. There are now white and correspondingly red possible genotype subpopulations. The model presented above for the single-color case can be easily extended to the two-color case by adding more elements to the frequency array . Essentially, this is equivalent to introducing twice as many genotypes in the model. Thus, for the model with coloring, we introduce and for representing the genotype- white sub-frequency and red sub-frequency, respectively. Altogether, we define and describing the frequency of -genotype in the white and red sub-populations, respectively. The whole population is then described by the array of genotype frequencies .

Please note that the growth and mutations steps are independent for the white- and red-sub-populations because the white and red cells cannot mutate into each other. Therefore, the growth and mutation steps for the white- and red-sub-populations can be analyzed and computed in parallel. However, the dilution step depends on the total size of the population and takes into account the frequency of occurrence of all white and red genotype- sub-populations in the whole population. Therefore, the formula (8) has to modified appropriately to describe the random selection of cells from sub-populations, where describes white genotype- sub-populations and describes red genotype- sub-populations.

3 Estimation Method

An output of realistic experiments on genetic evolution is an ensemble of trajectories of different genotypes. In case of the E. coli, there is also a distinction between the white and red cells. This ensemble arises because the experiment is repeated many times and the initial conditions for each trajectory can be controlled very well and correspond to 50-50 proportion of white and red cells of genotype 1 (lowest fitness). For each trajectory, several mutation events can occur which corresponds to radical shifts in the relative proportion of white and red cells. However, for each particular trajectory, the precise time of mutations (i.e. which day) and the precise type of mutations is unknown. The experimental observer can only record changes in the relative proportion of white and red cells; this information is recored daily with quite high accuracy. Therefore, our goal is to develop an optimization framework which will allow to accurately estimate (i) time of the mutation, (ii) fitness of the new emergent white- (or red-) cell mutant.

To achieve the goal outlined above, we concentrate here on trajectories which have only one or two mutations. This is motivated by biological experiments since each trajectory is recorded for a relatively short period of time (approximately 40 days), and the majority of trajectories only have one or two mutations. Trajectories with three or more mutations correspond to rare events and, in addition, such trajectories cannot be estimated accurately due to the relatively high sensitivity of later mutations (third, fourth, etc.) to small perturbations of parameters for earlier first and second mutation events.

To develop our estimation framework, we first derive explicit analytical expressions for the most-likely evolutionary trajectory conditioned on the observed composition of the bacterial population on the previous day, i.e. we assume that frequencies of bacterial genotype are known on the previous day, and we derive the most-likely population frequencies on the next day.

Some mutation events can be unobserved due to the fact that they are lost during the dilution step. Therefore, we develop explicit expressions for the most likely trajectory of the stochastic model outlined in section 2 given that there is only one observed mutation during the time-span of an evolutionary trajectory (however, the time of the mutation and the fitness of the emergent mutation are unknown) and we also develop expressions for the most-likely trajectory of the “locked-box” stochastic model given that there are only two observed mutations. We then develop an optimization procedure to compute numerically the best match for the time(s) of the mutation(s) and selective advantage of the emergent genotype(s) to best match the observed white-to-red ratio in the overall E.coli population during the whole time-span of the evolutionary trajectory. Finally, since an ensemble of such trajectories is available, we are able to estimate the number of possible genotypes for each sub-population and their fitnesses.

3.1 Most-likely trajectory given that there is only one mutation

In this section we develop analytical expressions for the most-likely trajectory of the stochastic model outlined in section 2 given that there is only one observed mutation. In particular, we develop iterative formulas that yield expected frequencies based on frequency data on the previous day. The one-mutation trajectory has three unknown parameters - (i) day of the mutation, , (ii) fitness of the emergent mutants, , and (iii) number of mutants that have emerged on day . The optimization algorithm described later will use fitting techniques to optimize these unknown parameters of the most-likely trajectory for a particular observed frequency curve.

From here on, assume that there is only one observed mutation event which occurs at an unknown time . At the beginning of the day and on each preceding day, there is only the one ancestor of genotype in the whole population and no mutants, so that white and red frequencies . Note that the observed frequencies of the white- and red-cells, and (t), respectively, are not necessarily equal to for , because of the stochasticity of the process. We will assume that there is a mutation in the white sub-population. Formulas for the mutation in the red sub-population are identical up to the index change.

At the beginning of day :
There is an unknown number of mutants that have emerged in the white population after the daily selection (dilution) done at the end of day . Denoting the (unknown) mutation rate by , has a Poisson distribution with mean Assume that these emerged mutants are white and of unknown genotype having an unknown growth factor .

At the beginning of day :


After the mutation event, the process only exhibits growth and we can proceed iteratively for , where is the final day of the observations.

At beginning of day :
The initial population contains only white cells of genotypes and only red cells of genotype , so that

Growth on day :
Subpopulation sizes become:

and all other sub-population sizes are zero. Introducing the variable as

the total population size at the end of the growth phase of day becomes .

Mutation on day :
On any given day various mutations can occur in both white- and red-subpopulations. However, these additional mutants are to be eliminated form the population during the dilution step. Therefore, denote the random numbers of mutants generated at the end of day as:

where and denote frequencies of genotypes after mutations in white and red sub-populations, respectively. Please note that the number of mutants (, , and ) defined above are Poisson random variables as defined in (2). For example, the conditional mean for the number of mutants 1-whites -whites is , which defines as Poisson random variables over rational numbers, i.e.

and conditional mean and variance of are and . Similar expressions hold for other two genotype frequencies after mutation.

Dilution on day :
Conditioning on the following non-zero frequencies at the beginning of day :


the dilution step must realize the following event = {random sample size picked from terminal population of day contains only types and whites, and type reds}.

The conditional distribution of such random samples given that event holds can be shown (using the multinomial properties and that the probabilities ignoring those frequencies are equal to ) to be a multinomial , where one extracts a random sample of size from the restricted population, defined by = {subpopulation consisting of only the white-, white-, and red- cells existing at the terminal stage of day just before dilution}.

The size of the population RestPop is given by:


where is a Poissonian random variable which corresponds to the random fluctuations of the population due to mutations


In particular, describes how many cells has left the population RestPop due to mutations. Moreover, if we define random variables which correspond to changes in each sub-population as


then and within the sub-population RestPop the frequencies before the dilution of white-1, white-, and red-1 genotype are given by


Note, that , , and are , where is the mutation rate and are growth factors. In particular, we can compute the conditional means and variances of random variables in (13). The conditional means are


where we used . The conditional variance of , given , can be computed easily since mutation events are independent and conditional variances of can be computed explicitly. Therefore,


where we used and the fact that . Cases for , , and are similar.

Therefore, frequencies in (14) can be approximated by their first-order expansions


and the neglected terms in the expansion above are much smaller than the leading order terms due to the small mutation rates.

Please note that in the discussion above , , , , and are defined in terms of frequencies and mutants on day . Therefore, can be treated as a constant in conditional expectation given , and expected values of mutation terms on day can be computed explicitly in terms of frequency information on day .

Expected frequencies on day with :
Given and that event holds, the mean conditional frequencies after selection are identical to the existing frequencies before selection, i.e. and for . Therefore, if we define



where we’ve taken into account that . Similarly, can be expressed as


where we also used that .

Note, that is the only unknown parameter in equations (18) and (19) since the frequency of the type-1 and type- white cells can only be observed together, i.e. . However, we can approximate the unknown frequency of type-1 white cells on day by its expectation, and arrive at the approximation of


where is the observed frequency of white cells on day , and is the expected frequency of type-1 white cells computed form the previous iteration, and we approximated as


Similarly, we can develop an approximation for


Please note, that the expressions above are general and do not make an assumption that the mutations are not reversible ().

3.2 Most-likely trajectory given that there two events

In this section we develop formulas for the most-likely trajectory with two mutation events. To follow the notation in the previous section, we assume that the first event is described by the triple , where is the time of the mutation, is the genotype after the mutation, is the number of mutants. Without the loss of generality, we can assume that the first mutation event occurs in the white sub-population (if this is not the case, just rename the colors). However, the second event can occur in the same sub-population (white) or in the opposite sub-population (red). We will develop expressions for the most-likely trajectory for both cases.

The second mutation event is described by the parameters , where is the time of the second mutation, is the mutation genotype, and is the number of mutants. To detect significant second events, the genotype in the second mutation should be stronger that the genotype of the first mutation, otherwise, the second event is not detectable with a significant level of confidence. Therefore, we assume that .

The analysis of the two-event trajectory is quite similar to the analysis presented in section 3.1 and we present here only the final expressions for the most-likely trajectory with two events.

3.2.1 Same Colors

Here we present formulas for the most-likely trajectory with two mutations in the white sub-population. Then, on day (for ), conditioned on frequencies (all other frequencies are zero) we obtain


where . Since and , becomes


Therefore, we can compute formulas in (23), (24), (25) compute iteratively by approximating , and substituting into the expressions above.

Different Colors

Here we develop formulas for the most likely trajectory when the second mutation event occurs in the red sub-population. Then, on day conditioned on frequencies we obtain


where . Since and , expression for becomes


Similar to the previous sections, we approximate and to compute the expressions above interatively.

4 Simulated Experiments

Given the expressions in (19), (20), and (22), we can formulate the optimization problem to estimate parameters of the underlying population evolution stochastic model. In particular, we assume that the mutation rate, , the mutation matrix , and the fitness of the ancestor genotype, , are known. The goal of our approach is to estimate the selective advantage given by (1). In addition, parameters (the time of mutation) and (number of emerging mutants) are also unknown and should be optimized.

We use the simulated data to illustrate our approach. We use the following parameters in our simulations and consequent reconstruction of the model’s fitnesses -

  • , the number of allowed genotypes,

  • , the number of simulated trajectories,

  • , initial population size,

  • , the selective advantages.

We considered the mutation matrix

with mutation rates . These parameters are consistent with numerical values reported in past and ongoing experiments (see e.g. [9]).

After simulating the experiments (trajectories), day-to-day frequency data was stored. It consisted of the white and red population frequencies along with individual mutant frequencies of each color, at the days’ beginnings. Examples of trajectories with one mutation and two mutations are presented in Figures 1 and 2, respectively. We considered that a trajectory reached a fixation if (or ) and stopped the simulation at . We allowed the maximum length of the trajectory in the absence of fixation to be days. We also considered that a mutation event has occurred when a mutant frequency reached of the population size. Approximately 70% and 23% of trajectories had 1 and 2 mutation event, respectively. Approximately 0.5% of trajectories had no mutation events and the average number of events before fixation was approximately 1.34.

We would like to emphasize that in experimental data only the frequency of the overall white population (solid line ) can be observed. The overall frequency of the red sub-population can also be observed, but does not carry any additional information. Therefore, the goal of this work is determine the time of emergence of white or red mutants only from the overall frequency of the white sub-population.

Figure 1: Examples of two trajectories with one mutation showing the total frequency of white-cell population and a mutant sub-population. Black line (labeled ) denotes the total size of the white-cell population. Left part - emergence of the red genotype- mutant (red line labeled ), Right part - emergence of the white genotype- mutant (blue line labeled ).

Figure 2: Examples of two trajectories with two mutations showing the total frequency of white-cell population and mutant sub-populations. Black line (labeled ) denotes the total size of the white-cell population. Left part - emergence of the red genotype- mutant (red line labeled ) and later emergence of the white genotype- mutant (blue line labeled ), Right part - emergence of the red genotype- mutant (red line labeled ) and later emergence of white genotype- mutant (blue line labeled ).

5 Fitness Estimation

In this section we formulate an optimization problem for computing the selective advantage of the emergent genotype. We will also develop an automatic test to verify whether each trajectory has one or two mutation events.

5.1 First Mutation Event

We will first apply expressions (20) and (22) for one mutation occurrence to define the optimization problem given that only one mutation occurred. We would like to point out that these expressions depend on four parameters - (i) the time of emergence, , (ii) number of mutants emerging on the day , , (iii) the genotype , and (iv) selective advantage of the mutant, . The choice of in (32) impacts which entries of the mutation matrix and are used in computing + in (20) and (22), respectively. We would like to note that the expected frequency of the white cells during the mutation can be computed as , , , . For they can be computed using recursive formulas (20) and (22).

Therefore, we define the following first-event optimization problem


where . We select the initial and final times for optimization as and , where is the threshold value to determine when the genotype sufficiently developed in the population. Typical values we used are . We use the value of the threshold in order to use the rest of the trajectory to define the optimization problem for the second-mutation event.

It is possible to implement a number of strategies for adjusting the final time of estimation, . In particular, one can use varying and estimate the goodness of fit using the cost function in the right-hand side of (32). It is also possible to include into the optimization problem in (32), i.e. optimize the cost function with respect to the final time . However, we expect that more refined procedures for determining would produce similar results, since the optimization procedure in (32) already produces a fairly narrow histograms of possible selective advantages, , as demonstrated later in this section.

5.2 Second Mutation Event

After the optimization problem for the first mutation event has been computed, we can use the rest of the trajectory data (i.e. with ) to define an optimization problem for the second mutation event. However, first, we need define an explicit criteria for the occurrence of the second mutation event. To this end, we define that a second mutation occurred if there is a time