Genotypic complexity of Fisher’s geometric model
Fisher’s geometric model was originally introduced to argue that complex adaptations must occur in small steps because of pleiotropic constraints. When supplemented with the assumption of additivity of mutational effects on phenotypic traits, it provides a simple mechanism for the emergence of genotypic epistasis from the nonlinear mapping of phenotypes to fitness. Of particular interest is the occurrence of reciprocal sign epistasis, which is a necessary condition for multipeaked genotypic fitness landscapes. Here we compute the probability that a pair of randomly chosen mutations interacts sign epistatically, which is found to decrease with increasing phenotypic dimension , and varies nonmonotonically with the distance from the phenotypic optimum. We then derive expressions for the mean number of fitness maxima in genotypic landscapes comprised of all combinations of random mutations. This number increases exponentially with , and the corresponding growth rate is used as a measure of the complexity of the landscape. The dependence of the complexity on the model parameters is found to be surprisingly rich, and three distinct phases characterized by different landscape structures are identified. Our analysis shows that the phenotypic dimension, which is often referred to as phenotypic complexity, does not generally correlate with the complexity of fitness landscapes and that even organisms with a single phenotypic trait can have complex landscapes. Our results further inform the interpretation of experiments where the parameters of Fisher’s model have been inferred from data, and help to elucidate which features of empirical fitness landscapes can be described by this model.
inv \correspondingauthorS. Hwang, S.-C. Park, and J. Krug
Corresponding author: Department of Physics, The Catholic University of Korea, 43 Jibong-ro, Wonmi-gu, Bucheon 14662, Republic of Korea. E-mail: firstname.lastname@example.org
[lines=2]A fundamental question in the theory of evolutionary adaptation concerns the distribution of mutational effect sizes and the relative roles of mutations of small vs. large effects in the adaptive process (Orr, 2005). In his seminal 1930 monograph, Ronald Fisher devised a simple geometric model of adaptation in which an organism is described by phenotypic traits and mutations are random displacements in the trait space (Fisher, 1930). Each trait has a unique optimal value and the combination of these values defines a single phenotypic fitness optimum that constitutes the target of adaptation. Because random mutations act pleiotropically on multiple traits, the probability that a given mutation brings the phenotype closer to the target decreases with increasing . Fisher’s analysis showed that, for large , the mutational step size in units of the distance to the optimum must be smaller than for the mutation to be beneficial with an appreciable probability. He thus concluded that the evolution of complex adaptations involving a large number of traits must rely on mutations of small effect. This conclusion was subsequently qualified by the realization that small effect mutations are likely to be lost by genetic drift, and therefore mutations of intermediate size contribute most effectively to adaptation (Kimura, 1983; Orr, 1998, 2000).
During the past decade, Fisher’s geometric model (FGM) has become a standard reference point for theoretical and experimental work on fundamental aspects of evolutionary adaptation (Tenaillon, 2014). In particular, it has been found that FGM provides a versatile and conceptually simple mechanism for the emergence of epistatic interactions between genetic mutations in their effect on fitness (Martin et al., 2007; Gros et al., 2009; Blanquart et al., 2014). For this purpose, two extensions of Fisher’s original formulation of the model have been suggested. First, phenotypes are assigned an explicit fitness value, which is usually taken to be a smooth function on the trait space with a single maximum at the optimal phenotype. Second, and more importantly, mutational effects on the phenotypes are assumed to be additive. As a consequence, any deviations from additivity that arise on the level of fitness are solely due to the nonlinear mapping from phenotype to fitness, or, in mathematical terms, due to the curvature of the fitness function. Because the curvature is largest around the phenotypic optimum, epistasis generally increases upon approaching the optimal phenotype and is weak far away from the optimum. Several recent studies have made use of the framework of FGM to interpret experimental results on pairwise epistastic interactions and to estimate the parameters of the model from data (Martin et al., 2007; Velenich and Gore, 2013; Weinreich and Knies, 2013; Perfeito et al., 2014; Schoustra et al., 2016).
A particularly important form of epistatic interaction is sign epistasis, where a given mutation is beneficial or deleterious depending on the genetic background (Weinreich et al., 2005). Two types of sign epistasis are distinguished depending on whether one of the mutations affects the effect sign of the other, but the reverse is not true [simple sign epistasis (SSE)]; or whether the interaction is reciprocal [reciprocal sign epistasis (RSE)]. For a pictorial representation of the two kinds of sign epistasis, see, for example, Poelwijk et al. (2007). Sign epistasis can arise in FGM either between large effect beneficial mutations that in combination overshoot the fitness optimum, or between mutations of small fitness effect that display antagonistic pleiotropy (Blanquart et al., 2014). The presence of sign epistasis is a defining feature of genotypic fitness landscapes that are complex, in the sense that not all mutational pathways are accessible through simple hill climbing, and multiple genotypic fitness peaks may exist (Weinreich et al., 2005; Franke et al., 2011; de Visser and Krug, 2014). Specifically, RSE is a necessary condition for the existence of multiple fitness peaks (Poelwijk et al., 2011; Crona et al., 2013). Following common practice, here a genotypic fitness landscape is understood to consist of the assignment of fitness values to all combinations of haploid, biallelic loci that together constitute the -dimensional genotype space. A peak in such a landscape is a genotype that has higher fitness than all its neighbors that can be reached by a single point mutation (Kauffman and Levin, 1987). Note that, in contrast to the continuous phenotypic space on which FGM is defined, the space of genotypes is discrete.
Blanquart et al. (2014) showed that an ensemble of -dimensional genotypic landscapes can be constructed from FGM by combining subsets of randomly chosen mutational displacements. Each sample of mutations defines another realization of the landscape ensemble, and the exploratory simulations reported by Blanquart et al. (2014) indicate a large variability among the realized landscapes. Nevertheless, some general trends in the properties of the genotypic landscapes were identified. In particular, as expected on the basis of the considerations outlined above, the genotypic landscapes are essentially additive when the focal phenotype representing the unmutated wild type is far away from the optimum and become increasingly rugged as the optimal phenotype is approached.
In this article we present a detailed and largely analytic study of the properties of genotypic landscapes generated under FGM. The focus is on two types of measures of landscape complexity, that is, the fraction of sign-epistatic pairs of random mutations and the number of fitness maxima in the genotypic landscape. A central motivation for our investigation is to assess the potential of FGM and related phenotypic models to explain the properties of empirical genotypic fitness landscapes of the kind that have been recently reported in the literature (Szendro et al., 2013; Weinreich et al., 2013; de Visser and Krug, 2014). The ability of nonlinear phenotype-fitness maps to explain epistatic interactions among multiple loci has been demonstrated for a virus (Rokyta et al., 2011) and for an antibiotic resistance enzyme (Schenk et al., 2013), but a comparative study of several different data sets using approximate Bayesian computation (ABC) has questioned the broader applicability of phenotype-based models (Blanquart and Bataillon, 2016). It is thus important to develop a better understanding of the structure of genotypic landscapes generated by phenotypic models such as FGM.
In the next section we describe the mathematical setting and introduce the relevant model parameters: the phenotypic and genotypic dimensionalities and , the distance of the focal phenotype to the optimum, and the standard deviation (SD) of mutational displacements. As in previous studies of FGM, specific scaling relations among these parameters have to be imposed to arrive at meaningful results for large and . We then present analytic results for the probability of sign epistasis and the behavior of the number of fitness maxima for large , both in the case of fixed phenotypic dimension and for a situation where the joint limit is taken at constant ratio .
Similar to other probabilistic models of genotypic fitness landscapes (Kauffman and Levin, 1987; Weinberger, 1991; Evans and Steinsaltz, 2002; Durrett and Limic, 2003; Limic and Pemantle, 2004; Neidhart et al., 2014), the number of maxima generally increases exponentially with , and we use the exponential growth rate as a measure of genotypic complexity. We find that this quantity displays several phase transitions as a function of the parameters of FGM which separate parameter regimes characterized by qualitatively different landscape structures. Depending on the regime, the genotypic landscapes induced by FGM become more or less rugged with increasing phenotypic dimension. This indicates that the role of the number of phenotypic traits in shaping the fitness landscapes of FGM is much more subtle than has been previously appreciated, and that the sweeping designation of as (phenotypic) “complexity” can be misleading. Further implications of our study for the theory of adaptation and the interpretation of empirical data will be elaborated in the Discussion.
Basic properties of FGM
In FGM, the phenotype of an organism is modeled as a set of real-valued traits and represented by a vector in the dimensional Cartesian space, . The fitness is assumed to be a smooth, single-peaked function of the phenotype . By choosing an appropriate coordinate system, the optimum phenotype, i.e., the combination of phenotypic traits with the highest fitness value, can be placed at the origin in . We also assume that the fitness depends on the distance to the optimum but not on the direction of , which can be justified by arguments based on random matrix theory (Martin, 2014). The uniqueness of the phenotypic optimum at the origin implies that is a decreasing function of . The form of the fitness function will be specified below when needed. Most of the results presented in this article are, however, independent of the explicit shape of , as they rely solely on the relative ordering of different genotypes with respect to their fitness.
When a mutation arises the phenotype of the mutant becomes , where is the parental phenotype and the mutational vector corresponds to the change of traits due to the mutation. The key result derived by Fisher (1930) concerns the fraction of beneficial mutations arising from a wild-type phenotype located at distance from the optimum. Assuming that mutational displacements have a fixed length and random directions, he showed that for
where denotes the complementary error function and . Thus, for large the mutational step size has to be much smaller than the distance to the optimum, , for the mutation to have a chance of increasing fitness.
As has become customary in the field, we here assume that the mutational displacements are independent and identically distributed random variables drawn from an -dimensional Gaussian distribution with zero mean. The covariance matrix can be taken to be of diagonal form , where is the -dimensional identity matrix and is the variance of a single trait (Blanquart et al., 2014). In the limit , the form of the distribution of the mutational displacements becomes irrelevant owing to the central limit theorem (CLT), and therefore Fisher’s result of Equation 1 also holds in the present setting of Gaussian mutational displacements of mean size (Waxman and Welch, 2005; Ram and Hadany, 2015); an explicit derivation will be provided below. Because lengths in the phenotype space can be naturally measured in units of , the parameters and should always appear as the ratio , as can be seen in Equation 1. Thus, without loss of generality, we can set . In the following we denote the (scaled) wild type phenotype by , its distance to the optimum by
and draw the displacement vectors from the -dimensional Gaussian density with unit covariance matrix.
By normalizing phenotypic distances to the SD of the mutational effect on a single trait, we are adopting a particular pleiotropic scaling that has been referred to as the “Euclidean superposition model” (Wagner et al., 2008; Hermisson and McGregor, 2008). An alternative choice which is closer to Fisher’s original formulation but appears to have less empirical support is the “total effect model,” wherein the total length of the mutational displacements is taken to be independent of . Since , this implies that the single trait effect size decreases with as . As a consequence, the parameter defined by Equation 2 becomes dependent and increases as , provided does not depend on (Orr, 2000). The results presented below will always be given in terms of ratios of the basic parameters of FGM, such that their translation to the total effect model is in principle straightforward. We will nevertheless explicitly point out instances where the two settings give rise to qualitatively different behaviors.
The genotypic fitness landscape induced by FGM
To study epistasis within FGM, Fisher’s original definition has to be supplemented with a rule for how the effects of multiple mutations are combined. Based on earlier work (Lande, 1980) in quantitative genetics, Martin et al. (2007) introduced the assumption that mutations act additively on the level of the phenotype. Thus the phenotype arising from two mutations , applied to the wild-type is simply given by . This definition suffices to associate an -dimensional genotypic fitness landscape to any set of mutational displacements (Blanquart et al., 2014). For this purpose the haploid genotype is represented by a binary sequence with length , with () in the presence (absence) of the th mutation. For the wild type for all , and in general the phenotype vector associated with the genotype reads
Two examples illustrating this genotype-phenotype map and the resulting genotypic fitness landscapes with and are shown in Figure 1.
As can be seen from the figure, the projection of the discrete genotype space onto the continuous phenotype space can give rise to multiple genotypic fitness maxima, although the phenotypic landscape is single peaked. It is the assumption of a finite (and hence discrete) set of phenotypic mutation vectors that distinguishes our setting from much of the earlier work on FGM, where mutations are drawn from a continuum of alleles (Fisher, 1930; Orr, 1998, 2000, 2005) and the probability of further improvement (as given by Equation 1) vanishes only strictly at the phenotypic optimum. Remarkably, our analysis shows that the conventional setting is not simply recovered by taking the number of mutational vectors to infinity; rather, the number of genotypic fitness maxima is found to increase exponentially with .
Since fitness decreases monotonically with the distance to the optimum phenotype, a natural proxy for fitness is the negative squared magnitude of the phenotype vector
where denotes the scalar product between two vectors and . This quantity is thus seen to consist of a part that is additive across loci with coefficients given by the scalar products , and a pairwise epistatic part with coefficients .
The first term on the right-hand side contains both additive and epistatic contributions associated with displacements along the direction. The second term is dominated by the diagonal contributions with and is of order because on average.
We now show how the first term on the right-hand side of Equation 5 can be made to vanish for a range of . For this purpose, consider the subset of phenotypic displacement vectors for which the component in the direction of is negative. There are on average such mutations, and the expected value of each component is
where the factor 2 in front of the integral arises from conditioning on . Setting for out of these vectors and for all other mutations, the sum inside the brackets in Equation 5 becomes approximately equal to , which cancels the term for . Since can be at most in a typical realization, such genotypes can be constructed with a probability approaching unity provided .
We will see below that the structure of the genotypic fitness landscapes induced by FGM depends crucially on whether or not the phenotypes of multiple mutants are able to closely approach the phenotypic optimum. Assuming that the contributions from the perpendicular displacements in Equation 5 can be neglected, which will be justified shortly, the simple argument given above shows that a close approach to the optimum is facile when , but becomes unlikely when . This observation hints at a possible transition between different types of landscape topographies at some value of which is proportional to . The existence and nature of this transition is a central theme of this article.
Since we are interested in describing complex organisms with large phenotypic and genotypic dimensions, appropriate scaling relations have to be imposed to arrive at meaningful asymptotic results. Three distinct scaling limits will be considered.
Fisher’s classic result (Equation 1) shows that the distance of the wild type from the phenotypic optimum has to be increased with increasing to maintain a nonzero fraction of beneficial mutations for . In our notation Fisher’s parameter is
and hence Fisher scaling implies taking at fixed ratio . We will extend Fisher’s analysis by computing the probability of sign epistasis between pairs of mutations for fixed and large , which amounts to characterizing the shape of genotypic fitness landscapes of size .
We have argued above that the distance toward the phenotypic optimum that can be covered by typical multiple mutations is of order , and hence the limit is naturally accompanied by a limit at fixed ratio
From a biological point of view, one expects that , which motivates considering the limit at constant phenotypic dimension . Under this scaling, the first term on the right-hand side of Equation 5 is of order , whereas the contribution from the perpendicular displacements is only . Thus in this regime the topography of the fitness landscape is determined mainly by the one-dimensional mutational displacements in the direction, which is reflected by the fact that the genotypic complexity is independent of to leading order and coincides with its value for the case , in which the perpendicular contribution in Equation 5 does not exist (see Results).
By contrast, the perpendicular displacements play an important role when both the phenotypic and genotypic dimensions are taken to infinity at fixed ratio
|Number of phenotypic traits, also referred to as phenotypic dimension|
|Length of the binary genetic sequence, also referred to as genotypic dimension|
|Ratio of phenotypic to genotypic dimension ()|
|,||-dimensional vector representing the wild-type phenotype and its magnitude|
|,||Wild-type phenotype vector in units of () and its magnitude ()|
|Genotype represented by a binary sequence of length|
|Binary number indicating absence (0) or presence (1) of a mutation at site () of the genotype|
|,||Phenotype vector corresponding to the genotype (Equation 3) and its magnitude|
|,||Random phenotypic displacement vector representing a mutation and its magnitude|
|Probability density of the random vector . In this article, the density is Gaussian with unit covariance matrix|
|Fisher’s scaling parameter; in our notation|
|Total number of fitness maxima in a genotypic fitness landscape averaged over all realizations of sets of ’s|
|Genotypic complexity defined as the ratio of to for ; see Equation 13|
|Transition point of that separates regimes I and II|
|Half of the average mutational displacement of a single trait conditioned on being positive ()|
|Fraction of mutations that are present in a genotype ,|
|Mean value of of a local maximum, also referred to as mean genotypic distance from the wild type|
|Mean value of of a local maximum, also referred to as mean phenotypic distance from the optimum|
Preliminary considerations about genotypic fitness maxima
To set the stage for the detailed investigation of the number of genotypic fitness maxima in Results, it is useful to develop some intuition for the behavior of this quantity based on the elementary properties of FGM that have been described so far. For this purpose we consider the probability for the wild type to be a local fitness maximum, which is equal to the probability that all the mutations are deleterious. Since mutations are statistically independent, we have
where is the error function. Under the (highly questionable) assumption that this estimate can be applied to all genotypes in the landscape, we arrive at the expression
for the expected number of genotypic fitness maxima.
Consider first the scaling limit 2, where . Expanding the error function for small arguments as we obtain
for , where was defined in Equation 6. We will show below that this expression correctly captures the asymptotic behavior for very large but generally grossly underestimates the number of maxima. The reason for this is that for moderate values of (in particular for ), the relevant mutant phenotypes are much closer to the origin than the wild type, which entails a mechanism for generating a large number of fitness maxima that grows exponentially with .
Such an exponential dependence on is expected from Equation 11 in the scaling limit 3, where is a nonzero constant and the expression in the square brackets is . Although this general prediction is confirmed by the detailed analysis for this case, the behavior of the number of maxima predicted by Equation 11 will again turn out to be valid only when is very large. In particular, whereas Equation 11 is an increasing function of for any , we will see below that the expected number of maxima actually decreases with increasing phenotypic dimension (hence increasing ) in a substantial range of . In qualitative terms, this can be attributed to the effect of the perpendicular displacements in Equation 5, which grows with and makes it increasingly more difficult for the mutant phenotypes to closely approach the origin.
The observation that the number of genotypic fitness maxima grows exponentially with in most cases motivates us to make use of the corresponding growth rate as a measure of the ruggedness of the landscape. We therefore define the genotypic complexity through the limiting relation
where is the average number of genotypic fitness maxima and is the sequence length. Since the total number of binary genotypes is , the complexity is bounded from above by . If any genotype had the same probability of being a fitness maximum (which is in fact not the case for FGM), we could write and hence .
The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. All numerical calculations including simulations described in this work were implemented in Mathematica and C++. When counting the number of local genotypic maxima, we checked all genotypes and counted the exact number for a randomly realized landscape, then took an average. All relevant source codes are available upon request.
In the following sections our results on the structure of genotypic fitness landscapes induced by FGM are stated in precise mathematical terms and the key steps of their derivation are outlined, with some technical details relegated to the appendices. To facilitate the navigation through the inevitable mathematical formalism, we display the definitions of the most commonly used mathematical symbols in Table 1. Moreover, we provide numbered summaries at the end of each subsection which state the main results without resorting to mathematical expressions.
We first study the local topography of the fitness landscape around the wild type, focusing on the epistasis between two random mutations with phenotypic displacements and . Since fitness is determined by the magnitude of a phenotypic vector, i.e., the distance of the phenotype from the origin, the epistatic effect of the two mutations can be understood by analyzing how the magnitudes of the four vectors , , and are ordered. To this end, we introduce the quantities
where division by guarantees the existence of a finite limit for . The sign of these quantities determines whether a mutation is beneficial or deleterious. For example, if , the mutation is beneficial; if the two mutations combined together confer a deleterious effect; and so on. We will see later that and are actually closely related to the selection coefficients of the respective mutations.
We proceed to express the different types of pairwise epistasis defined by Weinreich et al. (2005) and Poelwijk et al. (2007) in terms of conditions on the quantities defined in Equation Random mutations:. Without loss of generality we assume and consider first the case where both mutations are beneficial, . Then magnitude epistasis (ME), the absence of sign epistasis, applies when the fitness of the double mutant is higher than that of each of the single mutants, i.e., . Similarly, for two deleterious mutations the condition for ME reads . When one mutant is deleterious and the other beneficial, in the case of ME, the double mutant fitness has to be intermediate between the two single mutants, which implies that when .
The condition for RSE reads when both single mutants are beneficial and when both are deleterious, and the remaining possibility corresponds to SSE between two mutations of the same sign. If the two single mutant effects are of different signs, RSE is impossible and SSE applies when or . Figure 2 depicts the different categories of epistasis as regions in the plane. Note that the corresponding picture for is obtained by exchanging .
To find the probability of each epistasis, we require the joint probability density . In Appendix A it is shown that
which can be obtained rather easily by resorting to the CLT. The applicability of the CLT follows from the fact that and are sums of a large number of independent terms for (Waxman and Welch, 2005; Ram and Hadany, 2015). According to the CLT, it is sufficient to determine the first and second cumulants of these quantities. Denoting averages by angular brackets, we find the mean , the variance , and the covariance (). Similarly, the corresponding quantitites evaluated for are , , and (). With an appropriate normalization constant, this leads directly to Equation 15.
As a first application, we rederive Fisher’s Equation 1 by integrating over the region for all and , which indeed yields
An immediate conclusion from the form of is that it is unlikely to observe sign epistasis for large , because becomes concentrated along the line as increases. As can be seen in Figure 2, this line touches the region of SSE in one point for , whereas it maintains a finite distance to the region of RSE everywhere. This indicates that the probability of RSE decays more rapidly with increasing than the probability of SSE. Moreover, one expects the latter probability to be proportional to the width of the region around the line , where the joint probability in Equation 15 has appreciable weight, which is of order .
for the probabilities of RSE () and SSE (). Due to the nonlinearity of the phenotype-fitness map, FGM does not allow for strictly nonepistatic combination of fitness effects. The probability of ME, therefore, is given by . Interestingly, the probability of sign epistasis varies nonmonotonically with . To confirm our analytic results, we compare our results with simulations in Figure 3, which shows an excellent agreement.
Similarly, we can calculate the probabilities of sign epistasis conditioned on both mutations being beneficial, which in our setting means . The conditioning requires normalization by the unconditional probability of two random mutations being beneficial, which is given by the square of in Equation 1. Hence
As anticipated from the form of Equation 15, the fraction of sign-epistatic pairs of mutations decreases with increasing phenotypic dimension , and this decay is faster for RSE () than for SSE (). At first glance this might seem to suggest that FGM has little potential for generating rugged genotypic fitness landscapes. However, as we will see below, the results obtained in this section apply only to the immediate neighborhood of the wild-type phenotype. They are modified qualitatively in the presence of a large number of mutations that are able to substantially displace the phenotype and allow it to approach the phenotypic optimum.
Mutations of fixed effect size:
As a slight variation to the previous setting, one may consider the fraction of sign epistasis conditioned on the two single mutations to have the same selection strength, as recently investigated by Schoustra et al. (2016). In our notation this implies that , and it is easy to see that sign epistasis is always reciprocal in this case. If the two mutations are beneficial, , and the condition for (reciprocal) sign epistasis is . The corresponding probability is
Following the same procedure for deleterious mutations () one finds that the probability is actually symmetric around and hence depends only on .
To express in terms of the selection coefficient of the single mutations, we introduce a Gaussian phenotypic fitness function of the form
where is a measure for the strength of selection. The selection coefficient of a mutation with phenotypic effect is then given by
To fix the value of we note that the largest possible selection coefficient, which is achieved for mutations that reach the phenotypic optimum, is , and hence is related to the selection coefficient through . With this substitution, the result in Equation 20 becomes
The probability of sign epistasis conditioned on selection strength takes on its maximal value in the neutral limit and decreases monotonically with . Similar to the results of Equations 16, 17, and 18 for unconstrained mutations, it also decreases with increasing phenotypic dimension when and are kept fixed.
In a previous numerical study carried out at finite and , it was found that varies nonmonotonically with for the case of beneficial mutations, and displays a second peak at the maximum selection coefficient (Schoustra et al., 2016). The two peaks were argued to reflect the two distinct mechanisms giving rise to sign epistasis within FGM (Blanquart et al., 2014). Mutations of small effect correspond to phenotypic displacements that proceed almost perpendicularly to the direction of the phenotypic optimum, and sign epistasis is generated through antagonistic pleiotropy. On the other hand, for mutations of large effect, the dominant mechanism for sign epistasis is through overshooting of the phenotypic optimum. Because of the Fisher scaling implemented in this section with at fixed , the second class of mutations cannot be captured by our approach and only the peak at small remains. Figure 4A shows the full two-peak structure for a few representative values of , and Figure 4B illustrates the convergence to the asymptotic expression Equation 23 for the left peak. Using the results of Schoustra et al. (2016), it can be shown that the right peak becomes a step function for , displaying a discontinuous jump from to at .
When the phenotypic dimension is large and the Fisher parameter is moderate, the probability of RSE decays as , while that of SSE decays as . Although these probabilities decrease monotonically with at fixed , they have a nonmonotonic behavior as a function of : For small they increase with and for large they decrease with (see Figure 3). Under the pleiotropic scaling adopted in this work, this implies that the probabilities are nonmonotonic function of the wild-type distance at fixed and vice versa. In contrast, under the total effect model, where both the wild-type distance and scale as , the probabilities decrease monotonically and exponentially with .
Genotypic complexity at a fixed phenotypic dimension
In this section, we are interested in the number of local maxima in the genotypic fitness landscape. We focus on the expected number of maxima, which we denote by , and analyze how this quantity behaves in the limit of large genotypic dimension, , when the phenotypic dimension is fixed. For the sake of clarity, the (unique) maximum of the phenotypic fitness landscape will be referred to as the phenotypic optimum throughout.
The number of local fitness maxima:
Since fitness decreases monotonically with the distance to the phenotypic optimum, a genotype is a local fitness maximum if the corresponding phenotype defined by Equation 3 satisfies
for all . The phenotype vector appearing on the right-hand side of this inequality arises from , either by removing a mutation vector that is already part of the sum in Equation 3 () or by adding a mutation vector that was not previously present (). The condition in Equation 24 is obviously always fulfilled if , that is, if the phenotype is optimal, and we will see that in general the probability for this condition to be satisfied is larger the more closely the phenotype approaches the origin. A graphical illustration of the condition in Equation 24 is shown in Figure 5.
The ability of a phenotype to approach the origin clearly depends on the number of mutant vectors it is composed of, and all phenotypes with the same number of mutations are statistically equivalent. The expected number of fitness maxima can therefore be decomposed as
where is the number of possible combinations of out of mutation vectors and is the probability that a genotype with mutations is a fitness maximum. The latter can be written as
Here and below, stands for the integral over .
Equation 26 can be understood as follows. First, the function constrains to be the phenotype of as defined in Equation 3. Next, the integration domains of the ’s reflect the condition in Equation 24. Assuming, without loss of generality, that the genetic loci are ordered such that for and for , the maximum condition for requires , so the integration domain should be ; whereas for the condition is , corresponding to the integration domain . Using the integral representation of the function
we can write
It was argued on qualitative grounds in Model that phenotypes that approach arbitrarily close to the origin are easily generated when the scaled wild-type distance is small, but they become rare for large . As a consequence, it turns out that the main contribution to the integral over in Equation 29 comes from the region around the origin for small , but shifts to a distance along the direction for large . To account for this possibility, it is necessary to divide the integral domain into two parts, and , where is an arbitrary non-zero number with as . Thus, we write as
and correspondingly define and as
The total number of local maxima is then .
We first consider . Expanding around the origin , we show in Appendix C that
where is the volume of the ball. We need to estimate how small has to be for to be a local fitness maximum with an appreciable probability. Since the random vectors contributing to are statistically equivalent, it is plausible to assume that their average component parallel to is . We further assume that the conditional probability density of these vectors, conditioned on their sum reaching the ball around the origin, can be approximated by a Gaussian, which consequently has the form
For to be a phenotype vector of a local maximum, all these random vectors should lie in the region and the remaining (unconstrained) vectors should lie in . This event happens with probability
Thus, we can estimate the typical value of as the solution of
To find the asymptotic behavior of for large , we use Stirling’s formula in Equation 33 and approximate the summation over by an integral over . This yields
where the exponent is given by
Under the condition , the remaining integral with respect to can be performed by expanding to second order around the saddle point determined by the condition
Performing the resulting Gaussian integral with respect to one finally obtains
where is the solution of Equation 41, which is the (scaled) mean number of mutations in a local maximum. We will call the mean genotypic distance. This solution is not available in closed form, but it can be shown that and for small . Figure 6 compares Equation 42 with the mean number of local maxima obtained by numerical simulations for various ’s with , to show an excellent agreement even for .
It is obvious that will eventually be negative as increases for any value of , and this must be true also for the maximum value . Indeed, we found the threshold , above which is negative. This signals a phase transition in the landscape properties. Inspection of Equation 40 shows that the transition is driven by a competition between the abundance of genotypes with a certain number of mutations and their likelihood to bring the phenotype close to the optimum. The first two terms in the expression for are the standard sequence entropy [see, for example, (Schmitt and Herzel, 1997)] which is maximal at (), whereas the last term represents the statistical cost associated with “stretching” the phenotype toward to origin. With increasing , the genotypes contributing to the formation of local maxima become increasingly atypical, in the sense that they contain more than the typical fraction of mutations, and increases. For the cost can no longer be compensated by the entropy term and becomes negative. In this regime decreases exponentially with , and therefore the total number of fitness maxima , which by construction cannot be , must be dominated by the second contribution .
We defer the detailed derivation of to Appendix C and here only report the final result obtained in the limit , which is independent of and reads
This expression is valid for , but it dominates the contribution for large only when . Figure 7 indeed shows that Equation 43 approximates the mean number of local maxima for , that is, converges to for large . This figure also shows, as is clear by Equation 43, that is a increasing (decreasing) function of () for a fixed value of (). The expected number of maxima is small in absolute terms in this regime, which can be attributed to the fact that the expression inside the parentheses in Equation 43 takes the value at , and decreases rapidly toward unity for larger .
To understand the appearance of , we refer to Model, where it was argued that is the maximal distance toward the origin, which can be covered by a phenotype made up of typical mutation vectors. Correspondingly, the analysis in Appendix C shows that the main contribution to comes from phenotypes located at a distance from the origin, i.e., at a distance from the wild type. The sum over in Equation 33 is dominated by typical genotypes with , and therefore the main contribution to comes from phenotypes at a distance from the origin. The seeming divergence of as is an artifact of the approximation scheme, which assumes that the main contribution comes from the region where ; clearly this assumption becomes invalid when . We note that for very large and large , Equation 43 reduces to the expression obtained in Equation 11 on the basis of Fisher’s formula for the fraction of beneficial mutations from the wild-type phenotype.
To sum up, the leading behavior of is
with and given by Equation 42 and Equation 43, respectively. Since decreases to zero with in a power-law fashion at , the dominant contribution at this value is . At , the mean genotypic distance jumps discontinuously from to ; and the mean phenotypic distance , which is defined as the averaged magnitude of phenotype vectors for local maxima, jumps from to . The genotypic complexity defined in Equation 13 is given by