Greedy adaptive walks on a correlated fitness landscape
We study adaptation of a haploid asexual population on a fitness landscape defined over binary genotype sequences of length . We consider greedy adaptive walks in which the population moves to the fittest among all single mutant neighbors of the current genotype until a local fitness maximum is reached. The landscape is of the rough mount Fuji type, which means that the fitness value assigned to a sequence is the sum of a random and a deterministic component. The random components are independent and identically distributed random variables, and the deterministic component varies linearly with the distance to a reference sequence. The deterministic fitness gradient is a parameter that interpolates between the limits of an uncorrelated random landscape () and an effectively additive landscape (). When the random fitness component is chosen from the Gumbel distribution, explicit expressions for the distribution of the number of steps taken by the greedy walk are obtained, and it is shown that the walk length varies non-monotonically with the strength of the fitness gradient when the starting point is sufficiently close to the reference sequence. Asymptotic results for general distributions of the random fitness component are obtained using extreme value theory, and it is found that the walk length attains a non-trivial limit for , different from its values for and , if is scaled with in an appropriate combination.
keywords:Adaptation, Genotype space, Extreme value theory, Asexual population
[cor1]Tel:+82-2-2164-4524, Fax: +82-2-2164-4764
Ever since the concept of the fitness landscape was introduced by Sewall Wright (1932), it has played a central role in evolutionary biology (de Visser and Krug, 2014). Among the different variants of the concept used in the literature, we here restrict ourselves to fitness landscapes that map the genotype space into the real numbers by assigning a fitness value to every genotype. With this definition, the fitness landscape provides an intuitive picture of evolution as a hill-climbing process. A convenient choice for the genotype space is the -dimensional hypercube , which contains all binary sequences of length . Rather than specifying the genome on the level of DNA base pairs, the binary sequences keep track of the presence or absence of mutations compared to a wild-type genome, or (in a more coarse-grained representation) the presence or absence of entire genes.
In addition to the underlying fitness landscape, the dynamics of adaptation is governed by the population size and the mutation rate per genome, both of which are to be compared to the scale of fitness differences summarized in a typical selection coefficient . In the strong selection / weak mutation (SSWM) regime characterized by the conditions and the population is monomorphic for most of the time, and the adaptive process is guided by the landscape structure in a simple way (Gillespie, 1983, 1984). If a mutation to a fitter genotype occurs it has a nonzero probability of fixing in the population, whereas a mutation to a sequence with lower fitness is certain to go extinct. The low mutation rate makes it very unlikely for double mutations to occur. Accordingly, in this regime the population behaves as a point in sequence space that moves uphill in the fitness landscape by single mutational steps, a process referred to as an adaptive walk (Gillespie, 1983, 1984; Kauffman and Levin, 1987). An obvious feature of adaptive walks is that they end on a fitness maximum, that is, a genotype without fitter one-mutant neighbors. This makes the walk length, the number of steps until a maximum is reached, a property of interest.
A simplified version of the adaptive walk problem where the effect of mutant fitness on the fixation probability of beneficial mutations is neglected and any neighboring genotype of higher fitness can fix with equal probability was studied by Macken and Perelson (1989) and Flyvbjerg and Lautrup (1992). For rugged landscapes without fitness correlations the mean number of steps of such ‘random’ adaptive walks was found to be of the order of . When the effect of the fixation probability is incorporated the mean walk length is still logarithmic in the number of loci, but the coefficient of becomes dependent on the distribution of fitness values (Gillespie, 1983; Orr, 2002; Neidhart and Krug, 2011; Jain, 2011; Seetharaman and Jain, 2011). If the infinite limit is taken the walks no longer terminate and adaptation can be studied through the unbounded increase of the mean fitness of the population (Park and Krug, 2008).
When the population size is increased beyond the SSWM regime, the number of segregating sites becomes larger than two. In asexual populations this implies that two beneficial mutations compete with each other for fixation and the one with the larger fitness will be fixed preferentially. This phenomenon is connected to the Hill-Robertson effect (Hill and Robertson, 1966) and is commonly known as clonal interference (Gerrish and Lenski, 1998; Wilke, 2004; Park and Krug, 2007; Desai and Fisher, 2007; Park et al., 2010). A rough criterion for the clonal interference regime is provided by the condition . If we denote the mean fixation time in this regime by (which depends on , , and ), almost all beneficial single mutant neighbors of the most populated genotype will be present during the fixation process if . To model this regime by an adaptive walk, we use a deterministic rule for the next step: the walker chooses the genotype with the largest fitness among the sequences that are one mutation away. This kind of adaptive walk was called a ‘perfect’ or ‘gradient’ adaptive walk by Orr (2002, 2003), but here we follow Kauffman and Levin (1987) in referring to it as a greedy walk. Orr (2003) calculated the length of a greedy adaptive walk on an uncorrelated fitness landscape using an order statistics approach that is independent of the fitness distribution, provided it is continuous. In the limit the mean walk length is given by , which was suggested to be a lower bound on the mean number of steps for any adaptive walk [see also Rosenberg (2005)]. Note that for this description to faithfully represent adaptation under strong clonal interference, the mutation rate has to be small enough such that the creation of double mutants can be neglected (Szendro et al., 2013a).
The studies of adaptive walks mentioned above were based on the assumption of an uncorrelated random fitness landscape with maximal ruggedness, which is not supported by empirical evidence (Miller et al., 2011; Szendro et al., 2013b; de Visser and Krug, 2014). The effect of fitness correlations on adaptive walks has so far been addressed mostly in the context of ‘block model’ landscapes in which the genotype is subdivided into independent modules, each of which is assigned a random fitness, and the mean walk length is additive over modules (Perelson and Macken, 1995; Orr, 2006; Seetharaman and Jain, 2014; Nowak and Krug, 2015). Here we consider greedy adaptive walks on another class of tunably rugged fitness landscapes, the rough mount Fuji (RMF) model, which was originally introduced in the context of protein evolution (Aita et al., 2000). In the RMF model an uncorrelated random fitness landscape is superimposed on a linear fitness gradient, and the slope of this gradient serves as a tuning parameter controlling the ruggedness of the landscape.
The RMF model has recently been found to provide a convenient parametrization of many empirical fitness data sets (Franke et al., 2011; Szendro et al., 2013b; Neidhart et al., 2013), while at the same time allowing for detailed mathematical analysis of a wide range of landscape properties (Neidhart et al., 2014; Park et al., 2015). Of particular interest for our work are the results on the existence of selectively accessible mutational pathways, defined here as pathways to the global fitness maximum along which fitness increases monotonically and which are moreover directed, in the sense that the distance to the global optimum decreases in each step (Weinreich et al., 2005; Franke et al., 2011). Hegarty and Martinsson (2014) have shown that such pathways exist in the RMF model with a probability approaching unity for , whereas this probability tends to zero for uncorrelated landscapes. A population following a directed accessible pathway would perform an adaptive walk of steps, much longer than the walks on uncorrelated landscapes. However, the biological significance of accessible paths is not evident, because an evolving population may not find them even if they exist (Szendro et al., 2013a; Park et al., 2015).
In this paper, we study greedy adaptive walks on the RMF fitness landscape, focusing on the mean number of steps when is very large. For a specific choice of the distribution of the random fitness component in the RMF model we obtain an analytic solution for the full distribution of walk lengths and show that it attains a non-degenerate limit for , similar to Orr’s analysis of the uncorrelated case (Orr, 2003). We also consider the dependence of the walk length on the distance of the initial genotype from the reference state, and show that in a range of distances the walk length varies non-monotonically with the strength of the fitness gradient.
Arbitrary distributions of the random fitness component can be treated in the limit by exploiting the convergence of the maximum of random variables to one of the universal distributions of extreme value theory (EVT) (de Haan and Ferreira, 2006). The EVT approach to adaptation was pioneered by Gillespie (1984) and Orr (2002) and has meanwhile become an established conceptual framework that allows to organize and quantify the relation between the distribution of mutational effects and the corresponding adaptive behavior (Joyce et al., 2008; Orr, 2010; Rokyta et al., 2008; Schenk et al., 2012; Bank et al., 2014). Similar to the analysis of fitness landscape properties for the RMF model presented by Neidhart et al. (2014), we find that the behavior of the walk length is governed by the interplay between the ruggedness parameter and the tail properties of the distribution of the random fitness component. Specifically, if the tail of the distribution is fatter than exponential, the walk length reverts to the behavior found by Orr for uncorrelated landscapes for any fixed value of the fitness gradient. On the other hand, for tails thinner than exponential the effective strength of the fitness gradient increases without bound with increasing , such that the greedy walks traverse the entire landscape with high probability for . A non-trivial limit of the walk length is attained only when and are scaled together in a particular combination.
The RMF fitness landscape is constructed from an additive ‘mount Fuji’ fitness landscape by adding an independent and identically distributed (i.i.d.) random variable to the fitness of every genotype. By we denote a binary sequence of length which represents the genotype. In particular, we will call the sequence the reference sequence which has the largest fitness in the purely additive landscape. Its antipodal point on the hypercube, the sequence with all elements 0, will be denoted by . The fitness of a sequence in the RMF fitness landscape is then assigned as
where is the Hamming distance between and the reference sequence , is a positive real number, and are i.i.d. random variables with probability density and cumulative distribution function , defined as
The definition (1) should be interpreted in the Malthusian sense, where fitness values can be positive or negative. What Hegarty and Martinsson (2014) proved is that for in the limit there is almost surely a directed path from the antipode to the reference sequence along which fitness is monotonically increasing, irrespective of the actual form of , whereas for such paths almost surely do not exist.
Since we are interested in greedy walks, the statistics of the maximal value among groups of i.i.d. random variables will play an important role. For this reason we introduce the probability that the largest value among () i.i.d. ’s is smaller than , which is
with the corresponding density
The reason for considering variables rather than variables will become clear in Sec. 3.
As has been noted previously (Franke et al., 2010, 2011; Neidhart et al., 2014), many properties of the RMF model take on a particularly simple form when the random variables are drawn from the Gumbel distribution , and we will adopt this choice in Sec. 3. For the Gumbel distribution, and become
The Gumbel distribution is one of the three universal limiting distributions that arise in extreme value theory (de Haan and Ferreira, 2006), and we will exploit this connection in Sec. 4 where we study the properties of greedy adaptive walks for general choices of the distribution .
3 Gumbel-distributed random fitness component
3.1 Greedy walks starting from the antipodal sequence
Our analysis begins with the greedy walk starting from the antipodal sequence . As mentioned before, the probability that at least one accessible path from to exists converges to unity as for any finite (Hegarty and Martinsson, 2014). If the greedy walker takes such a path with probability of , the mean number of steps will be . On the other hand, the RMF with is identical to the uncorrelated rugged landscape or the House-of-Cards model (Kingman, 1978) and the mean number of steps of greedy walks is in the limit of infinite (Orr, 2003). Thus the first question to address is whether the greedy walk length remains finite for when .
To find the mean walk distance, we consider the probability that the walker takes at least steps. For convenience, we denote the sequence at the -th step by with . The fitness of is the largest among the single mutant neighbors of . To find , we make the assumption that is a decreasing function in , that is, the walker only takes steps in the direction towards the reference sequence , referred to as the uphill direction in the following. This assumption is plausible if , because a downhill step is possible only if the largest among the random fitness components of the downhill neighbors exceeds the largest among the random fitness components of the uphill neighbors by at least . Obviously, for reasonably large and a setting with rather short walks, this probability is negligible. The validity of this assumption will be ascertained later in a self-consistent way. Once the have been determined, it follows that the greedy walk takes exactly steps with probability and, in turn, the mean number of steps is
where is set to .
Let be the probability that the walker takes at least steps with and let (). Obviously,
A recursion relation for can be derived immediately from the definition:
with . Since has nearest neighbors in the uphill direction, we have considered defined in Eq. (4) in the recursion relation.
which satisfies the recursion relation with , we can write
for , which can be proved straightforwardly by mathematical induction. Thus, we get
as an exact expression for the distribution of walk length. Note that in the above derivation, the sign of does not play any role, which implies that the case of negative can be studied within the same scheme and Eq. (12) is valid for any . By symmetry, a greedy walk with negative can be interpreted as a walk starting from the reference sequence (see Sec. 3.2 for further discussion).
Since it does not appear feasible to extract simple analytic formulae from (12) for arbitrary and , below we will present approximate calculations for certain limiting cases. Before delving into detail, we derive a simple upper bound on . Since for and ,
This upper bound clearly shows that as for any when is drawn from the Gumbel distribution. That is, it is highly unlikely that a greedy walk can follow an accessible path all the way to the reference state, although such paths exist with probability 1 as shown by Hegarty and Martinsson (2014).
The limit at finite
Since Eq. (13) is valid for any , should be exponentially small for once . This self-consistently affirms the validity of the assumption used in writing down . In order to extract the limit of from (12) , we use to obtain
This expression has an appealing interpretation in terms of so-called -analogues (Koekoek et al., 2010). Recall that the -analogue of a number can be defined by , which satisfies the basic property that . Defining the -factorial as , we see that which reduces to Orr’s result in the limit , . Moreover, the mean walk length is given by
where is the -exponential function, defined as
We note for later reference that the expression (15) has been derived previously for the probability that random variables are ascendingly ordered, , where the are drawn independently from a Gumbel distribution (Franke et al., 2010). The reason for this coincidence will become clear below in Sec. 4.1.
Approximations for large and small
We next evaluate (15) for large and small , respectively. If , can be approximated as
for and . In the above approximation, we have kept terms up to in the denominator. Hence the mean distance becomes
which is close to the upper bound of Eq. (14).
For , we expand up to , which yields
Accordingly, is approximated as
where the Pochhammer symbol has been used. Since , the mean distance becomes
which reproduces the result by Orr (2003) when . The fact that the leading order correction is linear in implies that walks starting at the references sequence () are shorter than when is small. We will see below in Sec. 3.2 how this result generalizes to walks starting close to, but not at the reference sequence.
If and , and . Hence, to keep terms up to order , it is enough to consider only , which gives
Note that even if , the walker takes at least one step. This is because we take limit before limit and under this order of limits the probability that the reference sequence is a local maximum is zero for any . For later purposes we recall that the probability for a sequence at distance from the reference sequence to be a local fitness maximum is given by (Neidhart et al., 2014)
which vanishes when the limit is taken for and fixed . Thus the walker needs to take at least one step to reach a maximum.
In Fig. 1, we compare obtained from simulations of independent realization with sequence length to the approximations Eqs. (18), (21), and (22) together with the upper bound of Eq. (14). The simulation method is explained in A. As a rule of thumb, the large approximations work well for and the approximation for becomes accurate for .
The limit at finite
For finite , it is clear that the mean walk length should approach as . This limit can be attained when is much larger than the (typical) largest value among i.i.d. random variables. For the Gumbel case, this corresponds to or . To find an approximate solution of under this condition, we go back to Eq. (12) and expand in terms of as
for and , where we have kept terms up to . Hence
As anticipated, appears as an expansion parameter and approches as . Thus, it is quite plausible to assume a scaling form such that
where is a scaling function with asymptotic behavior for sufficiently small . That is, if we plot as a function of for sufficiently large , the data obtained for different combinations of and should collapse onto a single curve. To confirm this, we performed simulations for ranging from to . Figure 2 which is the result of independent realizations for each data point indeed confirms the existence of such a scaling function.
3.2 Greedy walks with arbitrary starting point
In this section, we relax the assumption that the walk always starts at the antipodal sequence and calculate the mean number of steps in the case that the initial genotype has Hamming distance from the reference sequence . Note that the case treated in the previous section correspond to and the case with in the previous section can be understood as a greedy walk starting at with positive . We consider the limit with kept finite. Since the RMF landscape is symmetric under the simultaneous transformations and , we can set to be non-negative without loss of generality.
Exact asymptotic solution
Unlike the previous section, the initial genotype has neighbors in both the uphill and downhill directions, and we cannot exclude the possibility that the walker takes a downhill step. Assume that the walker arrives at the sequence at the -th step and . Note that needs not be the same as . Now, we introduce the function
which is interpreted as the probability density that the largest fitness among the uphill (downhill) neighbors has the random contribution and all downhill (uphill) neighbors have smaller fitness when ().
As in Sec. 3.1, the probability of taking at least steps is denoted by . Since the walker may move in the uphill or downhill direction with non-negligible probability, we have to take into account all possible combinations of directions. If is the change in the distance from the antipodal sequence at the -th step, then the change in over a path is stored in an ordered set . Defining , the Hamming distance from after steps is . We assume (and will subsequently verify) that the probability that the walker takes steps is exponentially small for large . Accordingly, the scaled distance and therefore the function in (29) do not change significantly during the walk. Within this assumption, we can approximate (28) in the form
where , , , and , which is independent of .
Let be the probability that a walk has moved according to and the fitness of the sequence at the th step is smaller than . With we then have, in analogy to (8),
where the summation is over all possible combinations of . Similar to (9) one can construct a recursion relation for , which reads
where we have approximated because the relevant fitness values reside far in the tail of the distribution when is large.
If we neglect the effect of the change in on as assumed above, we get
where in the second line signifies an index-ordered product in descending order of , which should be interpreted as 1 if . The solvability of the nested chain of integrals in (33) is specific to the Gumbel distribution; see B. From Eqs. (30) and (33), we arrive at our central result
which reduces to Eq. (15) when .
Dependence of the walk length on and
Since and , the expression (34) is bounded from below and above by its values for and , respectively
In fact, using and for any real , one can easily see that , where the equality holds only when . That is, is an increasing function of , and correspondingly the mean walk length (7) decreases monotonically as the position of the starting point approaches the reference sequence, which is easily conceivable.
By contrast, the dependence of the mean walk length on is more complex. We have seen above in Sec. 3.1.3 that the walk length decreases with increasing when the walk starts at the reference sequence (), and we will now show that such an initial decrease occurs whenever . On the other hand, for very large the walk length must approach for any , and we must therefore expect a non-monotonic dependence on for . Such a behavior was already reported by Neidhart et al. (2014) on the basis of numerical simulations.
When , we can approximate up to as (see B for the derivation)
where . Accordingly, the mean number of steps becomes
where we have also expanded up to . Hence is an increasing function of for when is small enough, while for the mean walk length initially decreases with for small . Since the walk length is known to increase at large , it follows that there must be a turning point which, in the quadratic approximation (37), is given by
A comparison of Eq. (37) with simulations is shown in Fig. 3, which illustrates the accuracy of the analytic expression (37) for small . As predicted, it also confirms the absence of a turning point for . As decreases, the position of the turning point found in the simulations moves to larger , which makes the small approximation inaccurate for precisely pinpointing .
From Fig. 3, the position of the turning point seems to diverge as . When , the mean walk length decreases as for sufficiently large as shown in Sec. 3.1.3. When is very small, should therefore first decrease as , but eventually increase with for sufficiently large . As in the case of and in Sec. 3.1.3, when this quantity is expected to be well approximated by
where . In Fig. 4, we compare simulation results for small () with Eq. (39), which shows an excellent agreement as long as . Hence the turning point can be found by investigating the minimum of , which gives
where we have only kept the leading order of . Note that indeed diverges as . When , the mean walk length is well approximated by which is the result for with .
To put these results into perspective and provide an intuitive explanation of the observed non-monotonic behavior as a function of , it is instructive to compare the mean walk length to the density of local fitness maxima. Since the walk is trapped at local maxima, one generally expects an inverse relationship between the two quantities (Weinberger, 1991; Nowak and Krug, 2015). According to (23), the density of local fitness maxima at distance from the reference sequence becomes in the limit when at fixed , where we recall that . It is straightforward to check that decreases monotonically with increasing but displays a maximum as a function of for . The maximum is located at which is similar to (40) and also diverges for . We may thus conclude that, at least qualitatively, the behavior of the greedy walk length reflects that of the density of local maxima.
4 General distribution of the random fitness component
4.1 Reformulation of the problem
Up to now, we have presented a detailed analysis of greedy adaptive walks for the case of Gumbel-distributed random fitness components. In this section, we will generalize our findings to arbitrary probability distribution functions , focusing on the limit . As in Sec. 3.2, the initial genotype from which the walker starts has the Hamming distance from the reference sequence and we take at fixed . Under these conditions the walker takes both uphill and downhill steps. As long as the number of steps taken is much smaller than , the walk dynamics can be formulated in terms of the following game:
At each round (), one generates two random variables and , where is drawn from the distribution and from . Then choose the larger one between and . Assuming that the larger one is where can be either 1 or , this number is compared to , with . If is larger than the game is over. Otherwise, we set and go to the next round. Then the mean number of steps in the greedy walk is the same as the mean number of rounds up to the end of the game.
For convenience, we introduce an event
where is defined as above. With this notation, we can write down the probability that the game persists at least up to rounds as
where the summation is over all possible sequences of ’s of length .
For all steps are in the uphill direction and (42) reduces to a single term with , which can be written as
that is, the probability that the sequence of random variables is ascendingly ordered. This quantity was studied by Franke et al. (2010) who showed that it is given by (15) when the ’s are drawn from the Gumbel distribution. To see why this result applies in the present context, we note that the distribution function of the maximum among i.i.d. Gumbel random variables is given by
which is identical to the original distribution up to an overall shift that doesn’t affect the ordering probability (43).
4.2 Extreme value classes
In order to analyze the problem for general choices of the distribution function , we exploit the fact that and converge to one of the extreme value distributions when the limit is combined with a suitable rescaling of (de Haan and Ferreira, 2006). Specifically, we introduce random variables such that , where is an integer, and are parameters that depend on but not on , and . The parameters and have to be chosen such that the distribution of has a well defined limit as , that is, such that
exists and is non-degenerate.
In terms of the transformed random variables, the event can be recast as
where and . In the following we apply this approach to the three classes of extreme value distributions.
Gumbel class As a representative of the Gumbel class of extreme value theory we choose the Weibull distribution . Setting
the limit (45) becomes the Gumbel distribution
with support , as can be seen using the approximation . Accordingly,
For the case of an exponential distribution () it follows that , and we conclude that the results derived in Sec. 3 for Gumbel-distributed random fitness components in fact apply asymptotically to all distributions with exponential tails. On the other hand, when the tail of the distribution is fatter () or thinner () than exponential, asymptotically scales to zero or infinity, respectively, when the limit is taken at fixed . This implies that greedy adaptive walks on the RMF landscape behave asymptotically like those on an uncorrelated landscape in the first case, their length approaching , whereas in the second case the walks move all the way to the reference sequence and . Because of the logarithmic dependence of on , corrections to this asymptotic behavior are however expected to be important, and can be obtained from the results of Sec. 3 by replacing with .
Fréchet class This class comprises distributions with a power law tail and can be represented by with and . Choosing and , the limit (45) becomes
with the support . Accordingly, . Assuming that remains finite when taking the limit, approaches zero and the problem becomes identical to the greedy walk on an uncorrelated landscape.
Weibull class Lastly, we consider distributions with bounded support, as represented by the distribution function with . Setting and , the limiting distribution is
with the support . Hence, in this case .
For finite , is effectively infinite so that
To summarize the results of this section, we have shown that it is only for distributions with exponential tails that the mean greedy walk length displays a non-trivial dependence on , and in this case the results of Sec. 3 carry over without modification. In all other cases a non-trivial asymptotic behavior requires that the strength of the fitness gradient is scaled with in such a way that has a finite limit for .
For the non-Gumbel extreme value classes characterized by the limiting distributions (50) and (51) a closed-form solution analogous to that obtained in Sec. 3 for the Gumbel class appears to be out of reach, with the exception of the Weibull class with , where the explicit formula
can be derived for (see C). In the general case we therefore resort to approximations that are valid for small and large , respectively. Apart from their intrinsic interest, these results can be used to compute corrections to the asymptotic walk length when and are both finite. Throughout we assume a general limiting distribution function with the corresponding probability density .