# Adaptive walks and extreme value theory

###### Abstract

We study biological evolution in a high-dimensional genotype space in the regime of rare mutations and strong selection. The population performs an uphill walk which terminates at local fitness maxima. Assigning fitness randomly to genotypes, we show that the mean walk length is logarithmic in the number of initially available beneficial mutations, with a prefactor determined by the tail of the fitness distribution. This result is derived analytically in a simplified setting where the mutational neighborhood is fixed during the adaptive process, and confirmed by numerical simulations.

###### pacs:

87.23.Kg,05.40.-a,02.50.-rThe adaptation of a population to a novel environment is a fundamental process of evolutionary biology which continues to attract considerable attention from theoretical Orr2005 () as well as experimental Elena2003 () perspectives. Adaptation is driven by the occurrence of mutations that are beneficial in the new environment and therefore spread in the population, leading to an increase of fitness over time. This process displays a variety of dynamical patterns Park2010 () that depend on the supply of beneficial mutations (governed by the product of population size and mutation rate ) as well as on the structure of the fitness landscape, which encodes how the genetic configuration of an organism (its genotype) affects the number of offspring it will leave in the next generation.

A particularly simple, yet biologically relevant limit of adpative dynamics is the regime of strong selection and weak mutation (SSWM), where mutations are sufficiently rare to be treated as independent events, , and selection is strong enough for deleterious mutations (which decrease fitness) to be unable to spread Gillespie1983 (); Gillespie1984 (); Orr2002 (). In the SSWM regime the population is genetically homogeneous most of the time, and its dynamics can be described by a point in the space of genotypes which performs an adaptive walk towards higher fitness. Because of the low mutation rate such a walk is constrained to move by single mutational steps, and it terminates when a local fitness maximum is reached, where no nearest neighbor genotypes are available that would confer higher fitness. Despite its strongly simplified nature, the adaptive walk model is in principle amenable to quantitative tests in microbial evolution experiments Rokyta2005 (); Schoustra2009 (); Rokyta2009 (); Miller2011 ().

In the present Letter we study the length of such adaptive walks in a simple model of a rugged fitness landscape, where fitness values of genotypes are assumed to be independent random variables drawn from a common probability density . The genotype space is a generalized hypercube formed by sequences of letters drawn from an alphabet of size , such that each genotype has single mutant neighbors Kauffman1987 (). The walk is then specified by the transition probability from genotype to a neighboring genotype of higher fitness, . In the SSWM regime is proportional to the fixation probability of the corresponding beneficial mutation, i.e. the probability that it will become dominant rather than going extinct due to demographic fluctuations Kimura1962 (); Patwa2008 (). When the fitness difference between the initial and final genotype is small in absolute terms, , while still maintaining the strong selection condition , , the fixation probability is proportional to , and normalization leads to the expression Gillespie1983 (); Gillespie1984 (); Orr2002 ()

(1) |

After the transition the population has fitness and encounters a new set of random fitness values (apart from the fitness of the preceding genotype, which is however inaccessible because ).

Assuming that fitter neighboring genotypes are available at the starting point of the adaptive walk, we ask for the mean number of steps that are required to reach a local fitness maximum. Since most mutations available to a viable genotype are expected to be deleterious or neutral Eyre2007 (), we are mainly interested in the behavior of when . Simplified variants of this problem have been considered in previous work. In the random adaptive walk the dependence of the transition probability on fitness differences is ignored, and all available fitter neighbors are chosen with equal probability, which leads to with Kauffman1987 (); Macken1989 (); Flyvbjerg1992 (). On the other hand, for greedy walks which always move to the neighboring genotype of highest fitness, the walk length remains finite for and attains a limiting value of Orr2003 ().

For the full problem defined by the fitness-dependent transition probability (1) we show below that the asymptotic behavior of the mean walk length is generally logarithmic, with a coefficient that depends on the form of the tail of the fitness distribution . According to extreme value theory (EVT), the tail can be represented by the generalized Pareto form Pickands1975 (); Beisel2007 (); Joyce2008 (); Note1 ()

(2) |

where the shape parameter serves to distinguish between the different universality classes of EVT deHaan (). For the density (2) is defined for all and decays as a power law, representing the Fréchet class of EVT, whereas for its support is restricted to the interval and the distribution belongs to the Weibull class. The Gumbel class, comprising distributions of unbounded support that decay faster than a power law, is recovered in the limit . In previous work Joyce2008 () it has been shown that the adaptive walk with fitness distribution (2) reduces to the random (greedy) limit for (). For the density (2) develops a -function singularity at the upper boundary of its support, which implies that all available mutants have the same fitness and (1) reduces to a random choice. On the other hand, for the density (2) becomes extremely broad, such that the fitness of the most fit mutant in a neighborhood is typically much larger than all other fitness values and (1) reduces to the greedy rule.

In terms of the parametrization (2), our main result for the mean walk length reads

(3) |

This expression recovers the random limit () for , and shows that the greedy limit () is attained at , where the density (2) ceases to have a finite first moment. The result for the Gumbel class was previously obtained numerically by Orr Orr2002 () (see below), and analytically by Jain and Seetharaman Jain2011 () using an approach along the lines of Flyvbjerg1992 (). Surprisingly, the expression (3) also appears in the context of a completely different evolution model of quasispecies type, which applies in the limit of infinite populations Krug2003 (); Jain2005 (); Sire2006 (). The reason for this coincidence will be discussed at the end of the paper.

The Gillespie approximation. Our analysis is based on an approximation first introduced by Gillespie Gillespie1983 (). The key idea is to ignore the change in available fitness values that occurs after a jump of the adaptive walk, which implies that the entire adaptive process proceeds in a single, fixed neighborhood (Fig. 1). The expected length of the walk is then equal to the first passage time (or absorption time) of the Markov chain defined by the transition probability (1) for a fixed set of fitness values . For the following discussion it will be convenient to label the fitness values by their rank, such that . The mean absorption time to the final state of maximal fitness , starting from fitness rank , is then given by Gillespie1983 ()

(4) |

where is the th harmonic number, and

(5) |

with and fitness gaps . Because fitness only increases during the process, the absorption time is obviously independent of the fitness values above the starting rank.

Within the Gillespie approximation, the adaptive walk length is obtained by averaging the absorption time (4) with respect to the fitness distribution . Gillespie observed that the problem simplifies significantly if is assumed to fall into the Gumbel universality class of EVT. Taking the limit at fixed , the superior fitness values lie in the tail of the distribution, and it is known that the scaled fitness ranks converge to independent, identically distributed exponential random variables deHaan (). It then follows by symmetry that the average ratios in (4) are , and evaluation of the sum yields the simple result Gillespie1983 (); Orr2002 () , where denotes Euler’s constant. Simulations of the full problem show that the mean walk length differs from this approximate result only by an offset in the constant correction term, which is given by Orr2002 (). A similar calculation for the model with random choice of fitter neighbors yields a mean absorption time of Orr2002 (), which again differs from the mean walk length of the full model Macken1989 (); Flyvbjerg1992 () (quoted above) only by a small shift in the constant term. We will show below that the close agreement between the Gillespie approximation and the full model extends to general fitness distributions, and provide a qualitative explanation for this behavior.

General fitness distributions. We now turn to the approximate evaluation of the absorption time (4) for the other EVT classes. As a representative of the Fréchet class we choose the Pareto distribution , , which is a shifted and rescaled version of (2) with . A straightforward calculation shows that the expected value of the th out of fitness values is given by

(6) |

for . To estimate the fitness gap we take the derivative with respect to Note2 (), . Approximating the sum in (5) by an integral we then find , and hence . Inserting this into (4) and replacing sums by integrals we see that the first sum converges to a constant for , while the second, double sum diverges logarithmically as . Thus to leading order we find , which is identical to (3) with .

The calculation for the Weibull class of distributions with bounded support is similar. We consider distributions on the unit interval of the form with , corresponding to (2) with . The mean of the th out of values drawn from this distribution is given by for , and along the same lines of reasoning used previously we find that . Again, this implies that the first sum on the right hand side of (4) converges, whereas the second double sum diverges logarithmically, leading finally to , in agreement with (3). The result for the uniform distribution () was also obtained in Jain2011 ().

Simulations. Next we compare the prediction (3) to simulations, using both the full adaptive walk model and the simplified Gillespie model in a fixed mutational neighborhood. In the simulations of the full model, we avoided an explicit representation of the genotype space by creating the fitness values encountered during the walk ’on the fly’. This ignores the possibility of the same genotype being encountered more than once during the walk, which is however negligible for large Flyvbjerg1992 (). The total size of the neighborhood was in all cases, the starting rank was varied from to in factors of 2, and results were averaged over 1000 independent realizations. As can be seen in Fig. 2, the asymptotic prediction (3) is well satisfied in both kinds of simulations.

To rationalize the observed close agreement between the Gillespie approximation and the full adaptive walk, we analyze the effect that the two processes involved in a single step of the walk have on the rank of the current genotype (Fig. 1). In the first process, the choice of a fitter neighbor according to the transition probability , the rank of the genotype changes by an amount that is proportional to the initial rank; to be specific, the expected new rank conditioned on the original rank is given by for Joyce2008 (). The change of rank due to the subsequent change of the mutational neighborhood (which is omitted in the Gillespie approximation) can be deduced from the classic analysis of the number of exceedances Rokyta2006 (); Gumbel1950 (), which shows that the expected new rank conditioned on the old rank is , with a variance of order . Thus for the change in rank due to the change in neighborhood is a small perturbation (of relative size ) of the change that occurs in the first process, which explains the quantitative accuracy of the Gillespie approximation. The fact that the change of neighborhood on average increases the rank is consistent with the numerical observation that the adaptive walks in the full model are always slightly longer than in the Gillespie approximation (Fig. 2).

Relation to quasispecies models. The quasispecies approach to evolution assumes very large populations, , such that demographic fluctuations are absent and the adaptive process is completely deterministic Jain2007a (). In an uncorrelated random fitness landscape the most populated genotype then performs a kind of ‘adaptive flight’, which is essentially constrained to move between local fitness maxima and terminates only when the global fitness maximum is reached Krug2003 (); Jain2005 (). In the simple case of a one-dimensional genotype space, the length of such an adaptive flight depends logarithmically on the number of genotypes with a prefactor given precisely by the expression in (3), a behavior that was first observed numerically Krug2003 () and subsequently derived analytically in Sire2006 (). The formal relation to the adaptive walk problem can be traced back to the fact that the transition probability of the adaptive flight, which describes the rate at which the most populated genotype jumps from one fitness peak to the next, depends linearly on the fitness difference between the two peaks in the same way as the fixation probability (1) Sire2006 (). This structure also appears in the analysis of the collision statistics of a one–dimensional gas with quenched random velocities Bena2007 ().

Employing a completely different mathematical approach, Sire et al. Sire2006 () computed the mean length of the adaptive flights as well as the corresponding variance (see also Bena2007 ()). Using their result one finds that the index of dispersion (defined as the ratio of the variance to the mean) depends on the EVT parameter according to the simple expression , which takes its minimal value for the Gumbel class () and approaches unity for as well as for . This formula reproduces the results obtained in Jain2011 () for and , and we have checked numerically that it applies to the full adaptive walks problem for general . Thus, while the walk length has a Poisson distribution in the case of random dynamics Flyvbjerg1992 (), in general the fluctuations are sub-Poissonian.

Conclusions. We have analyzed a simple, paradigmatic model for the evolution of populations subject to rare mutations and strong selection, and derived a precise asymptotic relation between the length of adaptive walks and the tail of the underlying fitness distribution. While the predicted asymptotics may be difficult to observe in experiments, the EVT shape parameter can be estimated experimentally Beisel2007 (), and examples with Kassen2006 (), Rokyta2008 () and Schenk () have been identified.

An important restriction of our model is the assumption that fitness values of different genotypes are uncorrelated. Indeed, a recent study comparing the distributions of beneficial fitness effects encountered during the first and second steps of an adaptive walk found strong evidence for fitness correlations between neighboring genotypes Miller2011 (). Such correlations are likely to significantly affect the results presented here, and will be addressed in the future.

###### Acknowledgements.

This work was supported by DFG within SFB 680 and the Bonn Cologne Graduate School of Physics and Astronomy. We thank Kavita Jain and Henrik Flyvbjerg for useful correspondence.## References

- (1) H.A. Orr, Nat. Rev. Genet. 6, 119 (2005).
- (2) S.F. Elena, R.E. Lenski, Nat. Rev. Genet. 4, 457 (2003).
- (3) S.-C. Park, D. Simon and J. Krug, J. Stat. Phys. 138, 381 (2010).
- (4) J.H. Gillespie, Theor. Pop. Biol. 23, 202 (1983)
- (5) J.H. Gillespie, Evolution 38, 1116 (1984).
- (6) H.A. Orr, Evolution 56, 1317 (2002).
- (7) D.R. Rokyta, P. Joyce, S.B. Caudle and H.A. Wichman, Nat. Genet. 37, 441 (2005).
- (8) S.J. Schoustra, T. Bataillon, D.R. Gifford and R. Kassen, PLoS Biology 7, e1000250 (2009).
- (9) D.R. Rokyta, Z. Abdo and H.A. Wichman, J. Mol. Evol. 69, 229 (2009).
- (10) C.R. Miller, P. Joyce and H.A. Wichman, Genetics 187, 185 (2011).
- (11) S.A. Kauffman and S. Levin, J. Theor. Biol. 128, 11 (1987).
- (12) M. Kimura, Genetics 47, 713 (1962).
- (13) Z. Patwa and L.M. Wahl, J. R. Soc. Interface 5, 1279 (2008).
- (14) A. Eyre-Walker and P.D. Keightley, Nat. Rev. Genet. 8, 610 (2007).
- (15) C.A. Macken and A.S. Perelson, Proc. Natl. Acad. Sci. USA 86, 6191 (1989).
- (16) H. Flyvbjerg and B. Lautrup, Phys. Rev. A 46, 6714 (1992).
- (17) H.A. Orr, J. Theor. Biol. 220, 241 (2003).
- (18) J. Pickands III, Ann. Stat. 3, 119 (1975).
- (19) C.J. Beisel, D.R. Rokyta, H.A. Wichman and P. Joyce, Genetics 176, 2441 (2007).
- (20) P. Joyce, D.R. Rokyta, C.J. Beisel and H.A. Orr, Genetics 180, 1627 (2008).
- (21) An overall fitness scale is omitted in (2), since the transition probability (1) is independent of scale.
- (22) L. de Haan and A. Ferreira, Extreme Value Theory (Springer, Berlin 2006)
- (23) K. Jain and S. Seetharaman, preprint arXiv:1104.5583.
- (24) J. Krug and C. Karl, Physica A 318, 137 (2003).
- (25) K. Jain and J. Krug, J. Stat. Mech. P04008 (2005).
- (26) C. Sire, S.N. Majumdar and D.S. Dean, J. Stat. Mech. L07001 (2006).
- (27) These approximations can be corroborated by a more careful analysis, which will be presented elsewhere.
- (28) D.R. Rokyta, C.J. Beisel and P. Joyce, J. Theor. Biol. 243, 114 (2006).
- (29) E.J. Gumbel and H. von Schelling, Ann. Math. Stat. 21, 247 (1950).
- (30) K. Jain and J. Krug, in Structural approaches to sequence evolution, ed. by U. Bastolla, M. Porto, H.E. Romano and M. Vendruscolo (Springer, Berlin 2007), p.299.
- (31) I. Bena and S.N. Majumdar, Phys. Rev. E 75, 051103 (2007).
- (32) R. Kassen and T. Bataillon, Nat. Genet. 38 484 (2006).
- (33) D.R. Rokyta et al., J. Mol. Evol. 67, 368 (2008).
- (34) M.F. Schenk, I.G. Szendro, J. Krug and J.A.G.M. de Visser (unpublished).