# Emergence of clones in sexual populations

###### Abstract

In sexual population, recombination reshuffles genetic variation and produces novel combinations of existing alleles, while selection amplifies the fittest genotypes in the population. If recombination is more rapid than selection, populations consist of a diverse mixture of many genotypes, as is observed in many populations. In the opposite regime, which is realized for example in the facultatively sexual populations that outcross in only a fraction of reproductive cycles, selection can amplify individual genotypes into large clones. Such clones emerge when the fitness advantage of some of the genotypes is large enough that they grow to a significant fraction of the population despite being broken down by recombination. The occurrence of this “clonal condensation” depends, in addition to the outcrossing rate, on the heritability of fitness. Clonal condensation leads to a strong genetic heterogeneity of the population which is not adequately described by traditional population genetics measures, such as Linkage Disequilibrium. Here we point out the similarity between clonal condensation and the freezing transition in the Random Energy Model of spin glasses. Guided by this analogy we explicitly calculate the probability, , that two individuals are genetically identical as a function of the key parameters of the model. While is the analog of the spin-glass order parameter, it is also closely related to rate of coalescence in population genetics: Two individuals that are part of the same clone have a recent common ancestor.

Genetic diversity is the fodder for natural selection and the fuel of evolution. It is generated by mutations and by recombination, which reshuffles genomes and thereby accelerates the exploration of the space of genotypes. The latter consists of all of the possible combinations of the genetic variants, a.k.a. alleles, present at (biallelic) polymorphic loci. Of course the number of polymorphic loci, , itself changes as new mutations arise forming new polymorphisms while “older” polymorphisms disappear from the population. The population itself consists of individuals which sample only a small fraction of the possible genotypes, i.e. . The dynamics in genotype space is therefore highly stochastic.

Of particular importance are those genetic polymorphisms that affect the fitness of individuals, the fitness being defined as the expected number of offspring in the next generation. Selection on its own would amplify the number of high fitness individuals and condense the population into a few “clones” comprising a large fraction of the population. In populations of sexually reproducing organisms, the growth of such clones and the subsequent decline of genetic diversity are checked by recombination. Two parents, if chosen from different clones, produce offspring that are distinct from either parent. In obligate sexually reproducing species, the formation of clones is prevented since reproduction is coupled to recombination and no parent can produce genetically identical offspring. Many species, in particular microbial species and plants, can reproduce both by clonal reproduction (e.g., budding in yeast, selfing or vegetative reproduction in plants) or by sexual propagation. Such facultatively sexual species display a great variety in their mode of propagation, the frequency of outcrossing, and the heritability of fitness. The latter is very important in sexual populations, as it determines to what extend recombinant offspring benefit from the same fitness advantages that made their parents successful.

The aim of this article is to describe quantitatively the competing tendencies of natural selection and recombination with regard to genetic diversity, focusing on facultatively sexual organisms. The competition of natural selection with recombination is the dominant mechanism of evolution on relatively short time scales, on which mutational input is negligible compared to diversification by recombination. This situation is particularly relevant to adaptation following a major outcrossing event, or within a so called hybrid zone, where diverged genotypes have come together to generate a hybrid population. As this hybrid population continues to breed within itself, it can give rise to a bout of rapid adaptation, as beneficial alleles from both original populations are combined to form novel fit genotypes that spread within the hybrid population (Ortíz-Barrientos et al., 2002).

We will focus on the probability, , that two random individuals sampled from the population have the same genotype, i.e., are clones of each other. This quantity is important for population genetics, since it characterizes genetic diversity (its inverse is a measure of the number of dominant clones) as well as the dynamics of coalescence. Whenever two individuals are part of the same clone, they share a recent common ancestor, such that is proportional to the rate of pair coalescence. In the canonical theory of neutral coalescence, this rate is equal to the inverse population size. We will find here that , and with it the rate of coalescence, is determined by the clonal structure rather than the population size at low outcrossing rates.

Much of our analysis is presented in the context of a facultatively sexual population, where the evolving entities are individuals and their genotypes. Note, however, that some of our considerations also hold for contiguous segments of chromosomal DNA that are short enough to undergo only infrequent recombination even in obligatory sexual reproduction. In that case, we would be interested in the probability of a given chromosomal segment to be identical for a random pair of individuals drawn from the population. In this context, is the homozygosity of the population at this extended locus at which many different alleles segregate (this is of course a much weaker condition than clonal relation of whole genomes). A complementary view of this probability relates it to the “haplotype diversity”, i.e., the number and population distribution of distinct genomic sequences for the chromosomal segment in question. We shall return to this important question in the discussion.

In an earlier publication (Neher and Shraiman, 2009), we have shown that clones are absent in the so called Quasi Linkage Equilibrium (QLE) “phase” corresponding to frequent outcrossing limit but appear in a regime of small out-crossing frequency , with depending on the complexity of the fitness landscape (i.e. the extent of fitness additivity) and weakly dependent on . We will review this finding and present a more detailed analysis of the time dependence of this condensation phenomenon, as well as its quantitative dependence on fitness additivity and hence heritability. Furthermore, we also study the extend of clonal condensation in a steady state where the fitness distribution is moving towards higher fitness at a constant velocity. We will put these results into the context of the Random Energy Model (REM) of Statistical Physics, introduced and solved by Bernard Derrida (Derrida, 1981; Gross and Mezard, 1984; Mézard and Montanari, 2009). In fact, is closely related to the Parisi order parameter and the onset of clonality is closely related to the spin-glass transition observed in simple models of disordered media such as the REM.

Below, we will first draw the analogy between the dynamics of selection in finite populations and the REM. This analogy is particularly simple for . Whereas the condensation transition in the REM occurs below a certain critical temperature, the transition to clonal population structure occurs beyond a certain critical time, . Hence the population genetic analog of temperature will be the inverse time. We shall then generalize the model in order to include (facultative) recombination and fitness landscapes with varying degrees of epistasis, i.e., genetic interactions. The results for mixed epistatic and additive models enables us to set the analysis into the context of the “traveling wave” approximation, which has recently emerged as a powerful representation of adaptive population dynamics in genetically diverse populations (Tsimring et al., 1996; Rouzine et al., 2003; Cohen et al., 2005a; Desai and Fisher, 2007; Neher et al., 2010; Hallatschek, 2011). Our results therefore provide insight into how recombination and epistasis affect the dynamics and structure of adapting population waves and define conditions under which genetic diversity is maintained or lost.

## I Natural selection and the Random Energy Model.

In the absence of recombination or mutation, the frequency of any individual genotype increases if its fitness lies above the population mean and decreases otherwise. Identifying fitness with relative growth rate and ignoring stochastic effects, the expected number of individuals with genotype and fitness obeys:

(1) |

where is the time derivative of (we assume ) and the mean fitness is defined by averaging over the whole population. Defining the rate of growth of clones by differential fitness (relative to the population mean) ensures constant population size: . Since the mean fitness term is shared by every genotype in the population, the frequency of a particular genotype is given by

(2) |

where is a time dependent normalization constant known as partition function. Hence the distribution of clones in the population has the Boltzmann form, where inverse time plays the role of temperature. The dynamics of the population will now depend on the initial distribution of fitness values . In the generic case where fitness depends on many loci, each one giving a small contributions to fitness, the density of fitnesses is approximately Gaussian (with zero average and variance ) and the statistics of clones in the populations is identical to that of the REM (Derrida, 1981). For small population averages are dominated by the vicinity of the peak of , with many individual genotypes contributing. However, for , the dominant contribution shifts all the way to the leading edge , which corresponds to the maximum fitness sampled from the Gaussian in a population of size . This means that with increasing time, i.e., decreasing “effective temperature”, the population shifts to fitter and fitter genotypes and eventually, for , “condenses” into the fittest. This condensation phenomenon manifests itself in a non-negligible probability that two randomly chosen individuals from the population have the same genotype. The latter is equal to the average squared genotype frequency in the population at time . The average participation ratio is obtained by averaging over the fitness values of the initial genotypes

(3) |

where we have used the integral representation of and the fact that the are the i.i.d. random variables sampled from . This calculation is carried out explicitly in (Mézard et al., 1985; Mézard and Montanari, 2009) and leads to the following quantitative result: In the limit of large populations, is given by

(4) |

with . Fig. 1 shows how , measured in a computer simulation (see below), increases in time and compares to the REM prediction. Note that the sharp transition exhibited in 4 is realized only in the limit of large , which is hard to achieve both in reality and in a numerical simulation. In both cases one expects to find a crossover rather than a sharp transition. Nevertheless, considering the idealized limit provides a very useful scaffold for the analysis, and the REM also allows us to compute the corrections and therefore determine the detailed nature of the crossover.

### Recombination and heritability

Sexual reproduction mixes the genetic material from two parents and thereby produces new genotypes from existing genetic variation. To account for recombination, we modify the equation describing the evolution of genotypes as follows:

(5) |

where is the number of individuals with genome , and accounts for the probability that genotype is assembled by recombination from the parental genotypes and . In the absence of recombination, the only relevant characteristic of a genotype is its fitness, and we could study the evolution of the population along the fitness coordinate instead of on the hypercube of possible genotypes. To achieve a similar simplification with recombination we need to know how the fitness of recombinant offspring relates to that of the parents to map Eq. (5). The correlation between parental and offspring traits is known as heritability and depends on the underlying genetic architecture. If a trait depends on many loci in the genome in an additive manner, i.e., different loci make independent contributions to the trait, the trait value of offspring will be approximately Gaussian distributed around the parental mean. Such traits are called highly heritable since the correlation between trait values of parents and offspring is high. Conversely, if a trait depends on specific combinations of alleles at many loci (epistasis), these combinations will be disrupted with high probability in sexual reproduction. Such traits have a low heritability in sexual reproduction since the correlation between parents and offspring is low.

The trait we are mainly interested in here is fitness, which in general involves many phenotypes and depends on the environment. We shall set aside the issues associated with fluctuating environment and possible time-dependence of selection, and focus on how fitness depends on the genotype. As already noted, the genotype space is a high dimensional hypercube, , and for present purposes, the fitness is a complicated (but fixed) function of the genotype parameterized as

(6) |

where the terms define the additive contribution of locus to fitness, while higher order terms correspond to epistatic interactions. The relation of the fitness of an offspring relative to that of its parents and the density of states, i.e., the distribution of fitness over all possible genomes, is illustrated in Fig. 2.

The higher the order of the interactions, the less likely it is that a particular set of loci that contributes to one parent is inherited uninterupted. As a consequence, the heritability of interaction terms goes down with increasing order. Interaction terms of high-order are essentially independent of the parents. This is reminiscent of high-order spin glass models, where the energies of any two configurations that differ at a macroscopic number of spins are uncorrelated. The large limit of such -spin glasses is the random energy model, where the energy of each configuration is an independent draw from a Gaussian distribution (Derrida, 1981; Gross and Mezard, 1984).

## Ii The Model

In general the genetic architecture of fitness is expected to be complex with additive contributions as well as epistatic contributions of various orders. It is very instructive, though, to consider a simplified model that includes a heritable component and a random epistatic component. Specifically, let us assume that fitness can be decomposed into an additive component and an epistatic component . If two individuals with fitness and produce an offspring, its additive component, , is drawn from a Gaussian with variance around the parental mean , while its epistatic component, , is independent from that of the parents and is drawn from a Gaussian centered around with variance . Hence the recombination kernel of this model depends only on the additive fitness of the parents

(7) |

For much of the analysis below, we will simplify this model even further and assume that the additive fitness of recombinant genotypes is independent of that of the parents and simply drawn from a Gaussian with variance centered around the population mean . This distribution is expected if the distribution of in the population is approximately Gaussian. This simplified model is easier to analyze, while displaying the same qualitative behavior, as has been checked by numerical simulations. The joint distribution of and in the population then evolves according to

(8) |

The ratio of and define the extent of fitness heritability in the process of recombination. We define “heritability”, , which will be one of the key independent parameters of the model, as:

(9) |

To illustrate the behavior of the model at different recombination rates and different heritabilities, we implemented it as a computer simulation. The computer program keeps track of clones with fitness and and population size . At each generation, the size of a clone is updated by a Poisson distributed number with mean , where is a density regulating term. A clone is deleted if its size is 0. In addition, at each generation, new clones of size 1 are seeded, with additive fitness drawn from a Gaussian and epistatic fitness drawn from . Alternatively, we can impose a “velocity” of the additive fitness by drawing its value from instead of .

Due to the stochastic nature of reproduction, the majority of all initial genotypes will rapidly die out, even if very fit. Of the initial genotypes, only a fraction remains, where is a measure of the typical strength of selection. Similarly, of the clones produced by recombination, only a fraction “establishes”. Those clones that do establish are on average larger than the deterministic expectation by a factor of . We will neglect them in most of the formulae below. The dominant effect of this stochasticity can be accounted for by rescaling to inside logarithms. We will reinstantiate this correction in comparisons to simulations when necessary. This stochastic effects are of minor importance for the phenomena we study here since they are overshadowed by the randomness inherent in the fitness and seeding time of new clones.

## Iii Results: Structure of adapting populations

Depending on the rate of recombination and the degree of fitness heritability , the model formulated above exhibits very different behaviors. Before we present a formal characterization of the population structure and the dynamics of evolution, it is instructive to discuss the dynamics of the model as observed in simulations.

Fig. 3 shows the distribution of additive and epistatic fitness at two different times for scenarios where additive (left) or epistatic (right) fitness dominates. If fitness is predominantly additive, the population adapts and moves towards high additive fitness as long a new genotypes are generated by recombination. The velocity is given by the variance of the population along the additive direction. If most of the variance is along the epistatic direction as in the right panel, the adaptation of the population is much slower. In addition, large fractions of the population tend to be condensed into a small number of clones, as we will discuss at greater length below. As the variance along the additive direction decreases to zero, so does the velocity.

The population structure and dynamics at different recombination rates are illustrated in Fig. 4. The figure shows the composition of the population as a function of time for different recombination rates and different heritabilities. In each panel, the population was initialized as a diverse sample from the density of states and was subsequently allowed to evolve via selection and recombination. Hence each panel shows a transient, which gives way to a steady state behavior. Each genotype in the population is assigned a specific color, and individuals are ordered according to fitness in each time slice with the fittest individuals at the bottom.

The left column of Fig. 4 shows evolution governed by an all additive fitness function (, ) for different recombination rates. In this case, the population moves steadily to higher fitness since new genotypes, fitter than their parents, are constantly produced. At low recombination rates, the population consists of a small number of clones that arise at the high fitness edge (bottom part of the panel), grow as time progresses, and shrink again once they fall behind in fitness (i.e., when they move towards the center of the panel). As the recombination rate is increased to values comparable to , large clones cease to exist. Most of the population is made from nearly unique genotypes that are short lived. This pattern becomes even more pronounced at larger recombination rates. The high recombination limit of this dynamics can be understood in mean field theory (MFT) described in Section III.1. The low recombination limit is described in greater detail in Section III.4.

The right column of Fig. 4 shows the opposite limit, when fitness is not
heritable but completely epistatic, corresponding to a fitness function with
random high order interactions. In this case, we do not expect the population to
move towards higher fitness indefinitely since the epistatic fitness is
non-heritable. At low recombination rates, we expect that the fittest genotype
in the population will grow while generating new recombinants distributed around
zero fitness. With the growing genotypes, the mean epistatic fitness will
increase until the selection on the fittest genotype, , equals
the rate at which it produces recombinants. This behavior is clearly seen in all
four panels with on the right. Since the recombination rate increases
from top to bottom, the size at which the fittest genotype stabilizes decreases,
while the “dust”-like fraction of the population that consists of short-lived
unfit recombinants increases. Occasionally, recombination generates exceptionally fit genotypes,
which have a chance of replacing the previous record
holder. As time progresses, these records become rarer and rarer according to
the well known results on record dynamics^{1}^{1}1Note that a slightly
different dynamics is observed if recombination of individuals with the same
genotype do not produce a novel genotype. In this case the effective outcrossing
rate of large clones goes down and the population quickly condenses into a
unique clone.. The clone structure and its dynamics for the fully epistatic
case will be discussed in Section III.2. The high recombination limit
can again be understood within mean field theory; see Section III.1.

The center column of Fig. 4 shows the clone structure and dynamics of a case of intermediate heritability. At low recombination rate, the population is dominated by clones, and clones exist even when the additive population is only dust. However, unlike the case, no clone dominates forever, but new clones are continuously established. The velocity in this regime will depend critically on how often such new clones are produced and on how much they advance the additive fitness. We will investigate this case below in Section III.4.

### iii.1 Quasi-Linkage Equilibrium (QLE) and its breakdown

At large , this model admits a factorized solution with

(10) |

derived in (Neher and Shraiman, 2009). This solution travels toward higher additive fitness with a velocity equal to the variance of the additive fitness of recombinant offspring, while the epistatic fitness has steady distribution where adjusts itself to normalize the distribution. This factorization is a hallmark of Quasi-Linkage Equilibrium (Kimura, 1965; Turelli and Barton, 1994; Neher and Shraiman, 2011a), where additive components evolve independently, while epistatic components are in a quasi-steady balance between selection and recombination. However, this factorized solution breaks down as soon as there are individuals with epistatic fitness larger than . In that case this solution is no longer normalizable, and additive and epistatic components cease to be independent.

Fig. 5 illustrates this condensation behavior. The smooth distribution is a deformed Gaussian which diverges at . Beyond , the population consists of growing clones whose size depends on when and where they were seeded.

In the following we will characterize the population structure and study how the velocity in the direction of increasing additive fitness depends on parameters. We will begin by studying the purely epistatic case with recombination, which extends the REM by continuous seeding of novel recombinant genotypes. We will then study the condensation phenomenon in presence of additive fitness, where the population forms a traveling pulse in fitness.

### iii.2 Clonal condensation for zero heritability

Suppose we start with a diverse sample of size from the distribution of epistatic fitness and inject new recombinant genotypes with rate . After a time , the population consists of initial clones and a Poisson distributed number of new clones. A clone of age and fitness will have approximately size

(11) |

where the mean fitness can be thought of as a Lagrange parameter that keeps the overall population size constant. To calculate we have to average over the fitness of the initial clones and over the fitnesses and seeding times of all subsequently produced recombinant genotypes:

(12) |

where is the average of over the of initial genotypes (note that “initial” means that ). is the corresponding average over recombinant clones, which in addition are averaged over their age ; see Appendix A for derivation. Similarly, and are averages of over the and . The sum over – the number of recombinants generated in time – can be performed easily. These integrals are evaluated in the Appendix B in the limit of large . One obtains an approximation

(13) |

valid for , with

(14) |

We observe that the participation ratio substantially deviates from zero only for and only upon reaching , where for

(15) |

This provides us with estimates of the critical time of condensation and the critical below which condensation occurs. Condensation is delayed compared to the case of and the critical time (analogous to inverse critical temperature) diverges as the recombination rate approaches its critical value . Of course, this divergence only occurs in the limit of going to infinity. In practice, with realistic population sizes , one observes not a transition, but merely a crossover between different regimes of behavior. More elaborate expressions for and at finite can be found in Appendix B. Note that the values of and themselves scale with the square root of the logarithm of .

is shown in Fig. 6A for different recombination rates. With increasing recombination rate, the early plateau of is reduced and condensation is delayed. Each individual run condenses rapidly once the dominant clone is large enough that it often recombines with itself, which leaves the genotype intact. Also, the lack of condensation for is clearly seen. Fig. 6B shows the inverse time to condensation for different recombination rates and population sizes, confirming Eq. (15) for . This transition with increasing was identified in (Neher and Shraiman, 2009). Intuitively, the divergence of the time to condensation is due to the fact that the growth rate of best genotypes decreases with increasing recombination load . As soon as is smaller than zero for all existing genotypes, all clones are short-lived and no condensation can occur. Similar behavior has been observed in populations with heterozygote advantage or disadvantage (Franklin and Lewontin, 1970; Barton, 1983).

Population dynamics for , including , has the nature of a “records process” (Krug and Jain, 2005; Sire et al., 2006). As time progresses, more and more genotypes are sampled from the density of states and tested by selection. As a consequence, the population will come across fitter and fitter genotypes, resulting in a record process with trials. Even if initially no genotype with is present, such a genotype will eventually be found and result in condensation. On the other hand, any finite population will eventually reach a final “record” (with fitness ), giving rise to the clone that will eventually take over the whole population. This is because a clone rapidly fixates once it is large enough to frequently recombine with itself. To prevent its fixation a new record would have to be created with the fitness advantage within the time delay of , which is very unlikely beyond certain .

### iii.3 Traveling solutions for additive fitness

In the opposite limit when , the population moves towards higher fitness with a velocity that is given by the additive variance for sufficiently large . It will be useful to parameterize the ratio between the velocity and the scale of selection as . At high recombination rates, we have , while we generally expect at low recombination rates. In contrast to the case , no aging dynamics is observed for . Instead, old genotypes are constantly replaced, and dominant genotypes in the population have a finite characteristic age. Hence the initial genotypes are rapidly forgotten and we can restrict the analysis to recombinant genotypes.

Genotypes seeded at the high fitness nose of the population distribution can grow large and dominate the participation ratio . Since is closely related to the rate of coalescence, it is instructive to calculate explicitly assuming before considering the case at low or .

In the steady state, a clone is specified by its age and its initial fitness . Assuming , we then find

(16) |

To evaluate , we have to evaluate the integrals and that appear in Eq. (12) and are defined in the Appendix A, see Eq. (32). The integrals involve averaging over the initial fitness and all possible ages . For , one finds because relatively young and small clones dominate. In contrast, has a significant contribution from rare old clones, which dominate because of their large size (through the factor). The evaluation of the integrals is detailed in Appendix C with the result Eq. (55). We obtain

(17) |

The first term is of order 1 and corresponds to the contribution of young clones. Its exact value depends on the details of the stochastic dynamics, such as the offspring distribution. The second term is the contribution from old large clones. The participation ratio therefore becomes

(18) |

For small , does not scale as . In other words, the larger the population is, the larger are the clones it is composed of and those largest clones dominate . This result is in agreement with arguments made for rapid coalescence in facultatively sexual populations in (Neher and Shraiman, 2011b; Rouzine and Coffin, 2010, 2007). Fig. 7A shows for as a function of . As soon as , increases and no longer scales with , as predicted by the mean field calculation above. The figure also shows the explicit expression in Eq. (18) as dashed lines. Note, however, that the calculation of involved several approximations where was assumed large. As a result, the accuracy of the prefactor is low, and we have dropped all non-exponential parts, while enforcing at the cross-over.

The derivation above is valid for . For smaller , the velocity drops below the high recombination limit, , and the fitness distribution in the population stops being Gaussian. In this regime, the velocity has to be calculated self-consistently by matching the rate of production of new extremely fit clones to the overall velocity of the population. This problem has been studied by Rouzine and Coffin (2005, 2010) in the context of HIV evolution.

A simplified version of the arguments of Rouzine and Coffin (2005) is given in Appendix D. With , we find

(19) |

This implicit equation for can either be solved numerically or, in the case of very small , iteratively. The result is similar to that obtained by (Rouzine and Coffin, 2005); differences are due to the difference in the model used. The numerical solution is compared to simulation data in Fig. 7B. Convergence with increasing is slow and Eq. (19) has a solution only for small , for which we have little data.

The velocity of the traveling wave falls below as soon as a few large clones dominate the population. In this case, not only but also are dominated by those large clones, and behaves differently. We show in the appendix that is of order 1 (see Eq. (57)). This is in agreement with Rouzine and Coffin (2005), who found that the typical number of large clones is .

### iii.4 Traveling wave solutions at intermediate heritabilities

In the high recombination limit, the population moves towards higher fitness with velocity regardless of any epistatic fitness contributions. However, we have seen above that a clonal structure can emerge even in the absence of epistasis if recombination rates are low enough. Intuition and the simulation in Fig. 4 show us that this clonal structure is more pronounced in the presence of epistasis and persists at high recombination rates.

The reason for this behavior is the fact that the fitness variance of the recombinant offspring is larger than the velocity . In this case, fit genotype grow above the traveling Gaussian envelope and generate macroscopic clones.

Fig. 4 shows that, at low recombination rate and heritability, the population is always dominated by a few large clones with high non-heritable fitness components, which produce a large number of (on average) non-fit offspring. Rarely, such offspring are very fit, and replace their predecessors. The probability that offspring are fitter than their parents, and with it the velocity, increases dramatically with and . Conversely, the size of the fit clones and the participation fraction decreases with and . Simulation results for and in the steady state are shown in Fig. 8 for different values of and . In the following, we will rationalize and quantify the observed behavior. To begin with, we will assume a constant velocity and calculate . Again, these calculations are detailed in Appendix C.

After successful establishment, a clone will grow approximately deterministically according to

(20) |

where is the sum of the additive fitness , measured relative to when the clone was born, and is the epistatic fitness. The advancing mean additive fitness is accounted for by , while is the mean epistatic fitness. The size of the clone peaks at age (where we have defined ) and then decreases until the clone disappears. The maximal size of the clone is . Since the fittest genotypes in the population have a fitness above the mean, they grow larger than if

(21) |

which gives us a first indication of the breakdown of the mean field solution, where . We confirm this again by calculating the integrals and in Appendix C; see Eq. (54).

Furthermore, we calculate in Appendix C and find that starts to be larger than as soon as

(22) |

These two conditions on define two critical recombination rates and at which different features of the mean field solution break down. Note, however, that this expression is not valid close to , since we have assumed a traveling wave with clones of finite lifetime.

Both of these two lines seem to play an important role for the behavior of the population. Between the two lines, we observe a coexistence between condensed populations and non-condensed populations, which results in large fluctuations of . An example trajectory of is shown in Fig. 9. For a limited amount of time, the population is condensed with large and doesn’t move in the additive direction. It then becomes unstuck and adapts for a while before getting stuck again. The time average of is dominated by these intermittent condensates in this meta-stable region and is larger than as soon as . Good simulation evidence for the different transition lines, however, is difficult to obtain because of substantial sub-leading corrections.

To investigate the model in absence of this stick-slip behavior, we simulated a variant of the model where additive fitness is drawn from a distribution that steadily moves towards higher fitness with velocity , as is assumed in the calculations. The time averaged is shown in Fig. 9B and compared to the predicted onset of condensation by Eq. (57) (black lines).

## Iv Discussion

### iv.1 Summary

Correlations between different parts of the genome are usually referred to as linkage disequilibrium, suggesting that due to genetic linkage, i.e., a high probability of coinheritance, the allele frequencies at different loci are not independent. Here, we have used a different measure to characterize correlations in the population. Instead of looking at correlations between individual loci, we have characterized the distribution of clone sizes, or equivalently haplotype frequencies, in adapting populations. Our analysis is not restricted to additive fitness functions, but we have also analyzed a simple model where fitness is partly heritable and partly random.

While macroscopic condensation, , sets in only for in the absence of epistasis, we observe condensation of the population at recombination rates of order in presence of epistasis. Here, is the standard deviation in fitness and sets the strength of selection. The reason for this behavior is the fact that the velocity at which the population adapts is smaller than the fitness variance of the recombinant offspring. The additional epistatic variance allows the seeding of new clones far ahead of the population mean, which is only slowly catching up. Hence fit clones can out-grow any traveling Gaussian. At the same time, condensed clones cause the average epistatic fitness to be significantly greater than 0. Since this epistatic fitness is lost upon outcrossing, the population has a tendency to partition into a few fit clones with high epistatic fitness and a large number of recombinant genotypes with random epistatic fitness. This co-existence between “condensed” and “dust” phases is seen in Fig. 4 in the panels on the right. As long as the heritability, i.e., the fraction of additive variance, is larger than zero, the population seeds new clones even at low recombination rates and the rate of coalescence will be given by times the characteristic turn-over rate of clones.

In the absence of any additive variance, the observed behavior is quite different. In this case, the fitness function is completely random (a.k.a. House-of-cards model (Kingman, 1978), or random epistasis/energy model (Derrida, 1981; Neher and Shraiman, 2009)). The only way the population adapts is by sampling fitter and fitter individuals from the same distribution. In other words, the population dynamics amounts to a record process where the total number of samples taken increases as . Records establish and grow with the rate . One additional complication that is of particular importance in the case is the fact that whenever a clone recombines with itself, it does not generate a novel genotype. This has the tendency to shut off recombination and stabilize clones as soon as they grow large, resulting in a rapid loss of genetic diversity. In a previous publication (Neher and Shraiman, 2009), some of us have studied the onset of condensation in a more descriptive manner. Here, we have extended this work by explicitly calculating , both during an initial transient as well as in a steady state where variance is maintained.

The model we have used is extremely simplistic, and one might wonder about its relation to real world populations. Nevertheless, it accounts for a number of features of real populations such as heritabilities between and , outcrossing, and mimics a large number of loci in the sense of quantitative genetics. These features give rise to qualitatively different dynamical regimes, which will also be observable in more realistic models. Some facultatively sexual populations are in fact remarkably close to this simple “ and ” model. Many plants and microbial populations are facultative outcrossers. In the event of outcrossing, a large number of crossovers on many chromosomes produces many independently inherited genomic segments. HIV, for example, recombines via template switching following coinfection at an outcrossing rate of a few percent (Neher and Leitner, 2010; Batorsky et al., 2011). In each of these outcrossing events, roughly 10 crossovers are observed (Levy et al., 2004). If populations are polymorphic at many loci, the resulting off-spring distribution is approximately described by Eq. (7). We have made a further simplification by assuming that the fitness of a recombinant offspring is independent of its parents and simply drawn from a comoving distribution. This assumption is expected to approximate recombination processes where the offspring and parent fitness decorrelate rapidly over a few out-crossing events, as for example in Eq. (7) (Neher et al., 2010). Note, however, that loci close together on the chromosome decorrelate only slowly.

The other dramatic simplification made above was the partitioning of fitness into additive and random components corresponding to high order epistatic interactions. More generally we expect to find epistatic interactions of various orders, which are heritable to different extents. However, the “ and ” model should not be thought of as a parameterization of the genotype-fitness map, but rather as a partitioning into variance that can be explained by the best fitting additive model, and the remaining variance (Fisher, 1930; Neher and Shraiman, 2011a). The best-fitting additive model is in general time dependent, and the heritability can change as the allele frequencies change (Turelli and Barton, 2006). We know rather little about the genetic architecture of fitness, which justifies the the use of such simple models. In specific scenarios where the genotype-phenotype map is known, more detailed modeling should be guided by the general conclusions drawn from the “ and ” model.

The variation that we assume is always present among the recombinant off-spring, it is ultimately fueled by mutations. The balance between the influx of beneficial mutations and selection in facultatively sexual population has been investigated in (Cohen et al., 2005b; Neher et al., 2010). Even in sexual populations, the dynamics of mutation can be strongly affected by the selection through the clonal structure of the population.

### iv.2 Why is important?

The participation ratio, , is exactly the probability for two individuals to be genetically identical. Therefore, it immediately gives a measure of the clonal structure of the population. If the two genotypes are identical, they have had a common ancestor in the recent past. Hence, is proportional to the rate of coalescence, and as soon as is no longer proportional to , coalescence is greatly accelerated relative to a neutral model. It is well known that selection accelerates coalescence since fit individuals have more offspring and dominate future generations. Recombination tends to reduce the effects of linked selection since it decouples different regions of the genome. We have calculated the rate of coalescence in our model, which is set by a balance between selection and recombination. We have shown that there is a critical recombination rate, where recombination is overwhelmed by selection and the population structure changes qualitatively.

In the case of additive fitness functions, we have found that , which is in agreement with earlier work (Rouzine and Coffin, 2010; Neher and Shraiman, 2011b). In this case, the population consists of clones apparent as little “bubbles” in the representation of Fig. 4. Any such bubble originates from a common ancestor generations in the past, and is the probability that two individuals belong to the same “bubble”. This bubble-coalescent is similar to ideas developed for structured populations (Nordborg, 1997) or the fitness class coalescent (Walczak et al., 2012). Note, however, that the clone-size distribution is very long tailed and the bubble coalescent is not in the universality class of the Kingman coalescent (Kingman, 1982), but possibly of Bolthausen-Sznitman type (Brunet et al., 2007; Neher and Shraiman, 2011b; Mohle and Sagitov, 2001).

Genetic identity between some of the individuals reduces the effect of outcrossing, since identical parents produce identical offspring. Since the probability of such an occurrence is equal to , the effective rate of recombination in the partly clonal population is . Hence, strictly speaking our calculations of deep in the clonal condensation regime must be taken through a self-consistency step, which replaces that was hereto an independent variable by a dependent variable . This however would not change our estimates for and , which are defined by the point of first emergence of clones (when rises above ). Going beyond the MFT description of recombination, one may define a participation ratio density in terms of which which picks up the fitness dependence of : individuals with relatively high fitness are much more likely to be clonally related.

The significance of is not limited to the case of exact genetic identity within clones. In particular, mutations would introduce additional polymorphic loci into the “clones” that were the focus of our study. Yet the genetic structure of the population introduced by clonal condensation still survives: one only needs to distinguish between high-frequency polymorphisms, which are being reshuffled by recombination as approximated by our model and the low frequency new polymorphisms due to recent (on the time scale of clonal growth) mutations. The latter would appear on the background that is clonal with respect to the high-frequency alleles. Thus the “clones” emerging at low should be thought of as haplotypes (McVean and Cardin, 2005) and the “clonal condensation” is the process that suppresses the number of haplotypes on small length scales.

The participation ratio can be readily generalized to allow for a degree of genetic distance within a pair of individuals. As currently defined (where stands for the genetic distance between and ). This is immediately recognizable as a special case of the Parisi order parameter (Mézard and Montanari, 2009). The latter therefore provides an interesting representation of the haplotypic structure of populations. It would be interesting to understand whether more realistic models of fitness landscapes (with low order, rather than random epistasis) would generate more complex hierarchical structure of than the simple “dust/clone” dichotomy found in our simply model. The relation between the REM and Sherrington-Kirkpatrick models of spin glasses (Sherrington and Kirkpatrick, 1975) could provide useful guidance and ultimately yield better understanding of haplotype distributions and recombinant coalescents (Hudson and Kaplan, 1995; Stephens et al., 2001; Myers and Griffiths, 2003).

### iv.3 Future Directions

The analysis of “clonal condensation” presented above can and should be extended in several ways. Within the confines of the model considered, one may want to obtain a better understanding of the “mixed phase” lying between the two transition lines identified in Fig. 8. This phase is characterized by large fluctuations in clone size distribution, and hence in , even in the approximation imposing a fixed traveling wave velocity . In reality, the population sets its own instantaneous rate of change of average fitness, which depends sensitively on the fitness of the leading clones and changes with time as new fitness “records” are established by fresh recombinants. We have already described in Fig. 9 the “stick-slip” dynamics, which is characteristic of the mixed phase regime. (Needless to say, the existence of the mixed phase region corresponds to the 1st order nature of the clonal condensation transition for . ) A fully quantitative description of this behavior would require going beyond MFT. So far, attempts to describe fluctuations in the dynamics of adaptation have been few and far apart (Brunet et al., 2007; Hallatschek, 2011; Neher and Shraiman, 2012): a quantitative description of the “stick-slip” progress of adaptation would represent a major step forward.

Another necessary extension of the model involves mutational influx. A non-zero mutation rate would provide an influx of genetic variation and define true statistically stationary states corresponding to adaptive traveling waves (Tsimring et al., 1996; Rouzine et al., 2003; Desai and Fisher, 2007; Cohen et al., 2005a; Neher et al., 2010) or, in the presence of both deleterious and beneficial mutations, to a dynamic mutation selection balance (Goyal et al., 2011). In that case, emergent clones become “fuzzy” as they accumulate mutations, and the participation ratio should be replaced by a more general Parisi order parameter as explained above. The result should provide interesting quantitative insight into the expected genetic structure of facultatively sexual populations.

Perhaps the most interesting and important extension of the present work would involve a generalization of the model and its analysis to linear genomes and obligatory sexual populations. In contrast to the current model where recombination freely re-assorts the loci, a more realistic linear chromosome model would describe recombination in terms of relatively infrequent crossover. This would naturally tie the frequency of recombination to the length of the segment considered. We expect that on sufficiently short scales, where recombination is infrequent, a tendency for haplotype condensation would be manifest if the population is diverse enough. Whether epistasis plays a significant role in the condensation process will depend on the distribution of epistatic interactions along the genome (Neher and Shraiman, 2011a). If there is a strong tendency of mutations to interact with mutations nearby (Callahan et al., 2011), the heritability decreases as smaller and smaller segments are considered. This could result in condensation of mutations into “super-alleles”. However, the embedding of the haplotype in question into a larger genome gives rise to complications related to Hill-Robertson interference (Hill and Robertson, 1966; Hudson and Kaplan, 1995; Barton, 1994). Transient associations with other genomic regions will either boost (the hitch-hiking process) or suppress (background selection) the spread of a haplotype in the population. This reduces the efficacy of selection and gives rise to a stochastic dynamics with rather different properties than genetic drift (Neher and Shraiman, 2011b). Bridging the different scales and resulting dynamical regimes represent an important challenge for the future.

### iv.4 Conclusion.

In conclusion, we stress the distinction between the QLE and “clonal condensation” regimes. In the QLE regime, recombination is sufficiently rapid to overwhelm any clonal amplification due to selection, and correlations between alleles at different loci are relatively weak. In this regime, the correlations between loci are well described by linkage disequilibrium, which measures population averaged pair correlation of loci. By contrast, clonal condensation is a non-perturbative, large deviation from linkage equilibrium (under which loci are completely uncorrelated), which in particular results in a stratification of the population depending on its fitness: clones appear predominantly in the upper reaches of the fitness distribution. Strong heterogeneity among individuals along the fitness axis is not captured by measuring linkage disequilibrium and other traditional approaches. Understanding strong interactions in multi-locus systems requires new ideas and tools. We have found simple models such as the REM to be a very useful source of insight into these non-trivial aspects of population genetics.

###### Acknowledgements.

We are grateful for stimulating discussions with H. Teotonio, A. Dayarian, I. Rouzine, D. Fisher, M. Desai and M. Lynch and would like to thank T. Kessinger for careful reading of the manuscript. RAN is supported by an ERC-starting grant 260686 and BIS acknowledges support of the HFSP RFG0045/2010.## Appendix A Participation ratio for Clonal Condensation.

Following its definition for the REM we define the participation ratio for a set of clones as the sum of squared frequencies

(23) |

where

(24) |

describes the size of clone created a time in the past with fitness . Due to stochastic effects, most clones go extinct early on, and the fraction that remains is on average larger by a factor of . We will ignore this correction and reinstantiate it later when comparing to simulations. The rate of growth of the clone is controlled by the “replication rate”: the fitness relative to the time dependent average fitness of the population minus the recombination rate. The average over clones implies averaging over and . It can be decomposed into the average over the initial population (of individuals present at with ) and the average over subsequent recombinants that are generated via Poisson process with rate . Hence

(25) | |||||

(27) | |||||

(28) |

where we have defined

(29) |

which expresses the average over the initial population and

(30) |

which gives the contribution of all recombinants. If we further define

(31) |

(32) |

we arrive at the following expression for the participation fraction:

(33) |

At sufficiently long times, , the population is dominated by new recombinants so that and may be neglected in comparison with and .

The growth and decay of clones, Eq. (24), depends critically on the dynamics of the mean fitness, which in turn depends on the heritability of fitness. We have discussed the different cases ( and intermediate heritabilites) in the main text and will present the calculations pertaining to them in separate appendices below. To streamline the calculation, we will rescale all rates with (, , , ) and multiply times with (, ). Consequently, now equals . We will reinstantiate in the final results to facilitate comparison with formulae in the main text. In addition, the following integrals will prove useful:

(34) | |||||

where the latter holds only for .

## Appendix B Participation ratio for the fully epistatic case

The fully epistatic case with recombination or mutation is the closest relative of the REM in population genetics, known in the population genetics literature as the “house-of-cards” model (Kingman, 1978). In this model, every new genotype is drawn from the same distribution which we will take to be the standard normal distribution, that is, fitness is non-heritable.

Let us consider so that enough recombinants are produced to dominate over the initial set of genotypes and evaluate and . Furthermore, let us assume that the mean fitness is constant (we will revisit this assumption later). The size of a clone with age is then simply , and is given by

(36) |

Our strategy in approximating this integral will be to identify the region where and expand the exponential in that region; outside that region we shall neglect the exponential compared to (in the square bracket). Defining , we have

(37) | |||||

(38) | |||||

(39) | |||||

(40) | |||||

(41) |

where and we have along the way assumed that , which is realized for . As we shall see below, the dominant contribution to comes from the region , so that in the low regime () we have . At the end of the calculation we shall check that condition is satisfied.

At short times the second term in our expression for is clearly smaller than the first, so that the integration in the expression for is controlled by the factor, which set the scale for at . Let us next estimate the time for which the second term in the expression for becomes larger than the first provided that . We obtain

(42) |

Let us define and ; then

(43) |

For this can only be satisfied if , which diverges in the large limit. Below , but still close to it, we can approximate the crossover , and the crossover condition becomes

(44) |

which is an approximation valid for .

It remains to calculate the participation ratio as a function of asymptotically for and for . To do so, we need to evaluate the -integral and perform integration in the integral representation for . The -integral can be obtained by differentiating twice with respect to , which evaluates approximately to

(45) |

Hence