Genotypic complexity of Fisher’s geometric model

Genotypic complexity of Fisher’s geometric model

fundamental question in the theory of evolutionary adaptation concerns the distribution of mutational effect sizes and the relative roles of mutations of small vs. large effects in the adaptive process [32]. In his seminal 1930 monograph, Ronald Fisher devised a simple geometric model of adaptation in which an organism is described by phenotypic traits and mutations are random displacements in the trait space [12]. Each trait has a unique optimal value and the combination of these values defines a single phenotypic fitness optimum that constitutes the target of adaptation. Because random mutations act pleiotropically on multiple traits, the probability that a given mutation brings the phenotype closer to the target decreases with increasing . Fisher’s analysis showed that, for large , the mutational step size in units of the distance to the optimum must be smaller than for the mutation to be beneficial with an appreciable probability. He thus concluded that the evolution of complex adaptations involving a large number of traits must rely on mutations of small effect. This conclusion was subsequently qualified by the realization that small effect mutations are likely to be lost by genetic drift, and therefore mutations of intermediate size contribute most effectively to adaptation [21].

During the past decade, Fisher’s geometric model (FGM) has become a standard reference point for theoretical and experimental work on fundamental aspects of evolutionary adaptation [49]. In particular, it has been found that FGM provides a versatile and conceptually simple mechanism for the emergence of epistatic interactions between genetic mutations in their effect on fitness [25]. For this purpose, two extensions of Fisher’s original formulation of the model have been suggested. First, phenotypes are assigned an explicit fitness value, which is usually taken to be a smooth function on the trait space with a single maximum at the optimal phenotype. Second, and more importantly, mutational effects on the phenotypes are assumed to be additive. As a consequence, any deviations from additivity that arise on the level of fitness are solely due to the nonlinear mapping from phenotype to fitness, or, in mathematical terms, due to the curvature of the fitness function. Because the curvature is largest around the phenotypic optimum, epistasis generally increases upon approaching the optimal phenotype and is weak far away from the optimum. Several recent studies have made use of the framework of FGM to interpret experimental results on pairwise epistastic interactions and to estimate the parameters of the model from data [25].

A particularly important form of epistatic interaction is sign epistasis, where a given mutation is beneficial or deleterious depending on the genetic background [56]. Two types of sign epistasis are distinguished depending on whether one of the mutations affects the effect sign of the other, but the reverse is not true [simple sign epistasis (SSE)]; or whether the interaction is reciprocal [reciprocal sign epistasis (RSE)]. For a pictorial representation of the two kinds of sign epistasis, see, for example, [38]. Sign epistasis can arise in FGM either between large effect beneficial mutations that in combination overshoot the fitness optimum, or between mutations of small fitness effect that display antagonistic pleiotropy [3]. The presence of sign epistasis is a defining feature of genotypic fitness landscapes that are complex, in the sense that not all mutational pathways are accessible through simple hill climbing, and multiple genotypic fitness peaks may exist [56]. Specifically, RSE is a necessary condition for the existence of multiple fitness peaks [39]. Following common practice, here a genotypic fitness landscape is understood to consist of the assignment of fitness values to all combinations of haploid, biallelic loci that together constitute the -dimensional genotype space. A peak in such a landscape is a genotype that has higher fitness than all its neighbors that can be reached by a single point mutation [20]. Note that, in contrast to the continuous phenotypic space on which FGM is defined, the space of genotypes is discrete.

[3] showed that an ensemble of -dimensional genotypic landscapes can be constructed from FGM by combining subsets of randomly chosen mutational displacements. Each sample of mutations defines another realization of the landscape ensemble, and the exploratory simulations reported by [3] indicate a large variability among the realized landscapes. Nevertheless, some general trends in the properties of the genotypic landscapes were identified. In particular, as expected on the basis of the considerations outlined above, the genotypic landscapes are essentially additive when the focal phenotype representing the unmutated wild type is far away from the optimum and become increasingly rugged as the optimal phenotype is approached.

In this article we present a detailed and largely analytic study of the properties of genotypic landscapes generated under FGM. The focus is on two types of measures of landscape complexity, that is, the fraction of sign-epistatic pairs of random mutations and the number of fitness maxima in the genotypic landscape. A central motivation for our investigation is to assess the potential of FGM and related phenotypic models to explain the properties of empirical genotypic fitness landscapes of the kind that have been recently reported in the literature [47]. The ability of nonlinear phenotype-fitness maps to explain epistatic interactions among multiple loci has been demonstrated for a virus [41] and for an antibiotic resistance enzyme [43], but a comparative study of several different data sets using approximate Bayesian computation (ABC) has questioned the broader applicability of phenotype-based models [4]. It is thus important to develop a better understanding of the structure of genotypic landscapes generated by phenotypic models such as FGM.

In the next section we describe the mathematical setting and introduce the relevant model parameters: the phenotypic and genotypic dimensionalities and , the distance of the focal phenotype to the optimum, and the standard deviation (SD) of mutational displacements. As in previous studies of FGM, specific scaling relations among these parameters have to be imposed to arrive at meaningful results for large and . We then present analytic results for the probability of sign epistasis and the behavior of the number of fitness maxima for large , both in the case of fixed phenotypic dimension and for a situation where the joint limit is taken at constant ratio .

Similar to other probabilistic models of genotypic fitness landscapes [20], the number of maxima generally increases exponentially with , and we use the exponential growth rate as a measure of genotypic complexity. We find that this quantity displays several phase transitions as a function of the parameters of FGM which separate parameter regimes characterized by qualitatively different landscape structures. Depending on the regime, the genotypic landscapes induced by FGM become more or less rugged with increasing phenotypic dimension. This indicates that the role of the number of phenotypic traits in shaping the fitness landscapes of FGM is much more subtle than has been previously appreciated, and that the sweeping designation of as (phenotypic) “complexity” can be misleading. Further implications of our study for the theory of adaptation and the interpretation of empirical data will be elaborated in the Discussion.

Model

Basic properties of FGM

In FGM, the phenotype of an organism is modeled as a set of real-valued traits and represented by a vector in the dimensional Cartesian space, . The fitness is assumed to be a smooth, single-peaked function of the phenotype . By choosing an appropriate coordinate system, the optimum phenotype, i.e., the combination of phenotypic traits with the highest fitness value, can be placed at the origin in . We also assume that the fitness depends on the distance to the optimum but not on the direction of , which can be justified by arguments based on random matrix theory [24]. The uniqueness of the phenotypic optimum at the origin implies that is a decreasing function of . The form of the fitness function will be specified below when needed. Most of the results presented in this article are, however, independent of the explicit shape of , as they rely solely on the relative ordering of different genotypes with respect to their fitness.

When a mutation arises the phenotype of the mutant becomes , where is the parental phenotype and the mutational vector corresponds to the change of traits due to the mutation. The key result derived by [12] concerns the fraction of beneficial mutations arising from a wild-type phenotype located at distance from the optimum. Assuming that mutational displacements have a fixed length and random directions, he showed that for

where denotes the complementary error function and . Thus, for large the mutational step size has to be much smaller than the distance to the optimum, , for the mutation to have a chance of increasing fitness.

As has become customary in the field, we here assume that the mutational displacements are independent and identically distributed random variables drawn from an -dimensional Gaussian distribution with zero mean. The covariance matrix can be taken to be of diagonal form , where is the -dimensional identity matrix and is the variance of a single trait [3]. In the limit , the form of the distribution of the mutational displacements becomes irrelevant owing to the central limit theorem (CLT), and therefore Fisher’s result of Equation 1 also holds in the present setting of Gaussian mutational displacements of mean size [52]; an explicit derivation will be provided below. Because lengths in the phenotype space can be naturally measured in units of , the parameters and should always appear as the ratio , as can be seen in Equation 1. Thus, without loss of generality, we can set . In the following we denote the (scaled) wild type phenotype by , its distance to the optimum by

and draw the displacement vectors from the -dimensional Gaussian density with unit covariance matrix.

By normalizing phenotypic distances to the SD of the mutational effect on a single trait, we are adopting a particular pleiotropic scaling that has been referred to as the “Euclidean superposition model” [51]. An alternative choice which is closer to Fisher’s original formulation but appears to have less empirical support is the “total effect model,” wherein the total length of the mutational displacements is taken to be independent of . Since , this implies that the single trait effect size decreases with as . As a consequence, the parameter defined by Equation 2 becomes dependent and increases as , provided does not depend on [29]. The results presented below will always be given in terms of ratios of the basic parameters of FGM, such that their translation to the total effect model is in principle straightforward. We will nevertheless explicitly point out instances where the two settings give rise to qualitatively different behaviors.

The genotypic fitness landscape induced by FGM

To study epistasis within FGM, Fisher’s original definition has to be supplemented with a rule for how the effects of multiple mutations are combined. Based on earlier work [22] in quantitative genetics, [25] introduced the assumption that mutations act additively on the level of the phenotype. Thus the phenotype arising from two mutations , applied to the wild-type is simply given by . This definition suffices to associate an -dimensional genotypic fitness landscape to any set of mutational displacements [3]. For this purpose the haploid genotype is represented by a binary sequence with length , with () in the presence (absence) of the th mutation. For the wild type for all , and in general the phenotype vector associated with the genotype reads

Two examples illustrating this genotype-phenotype map and the resulting genotypic fitness landscapes with and are shown in Figure 1.

Figure 1: Examples of three-dimensional genotypic fitness landscapes induced by FGM with two phenotypic dimensions (L=3 and n=2). The panels show the projection of the discrete genotype space onto the phenotype plane, where the phenotypic optimum is represented by a black . In the left panel, the binary sequence notation for genotypes is indicated. The wild-type genotype 000, marked by a green \blacktriangle, is located at distance Q from the phenotypic optimum. The nodes represented by red \blacksquare’s are local fitness maxima of the genotypic landscapes, as can be seen from the contour lines of constant fitness. In the right panel the mutant phenotypes overshoot the optimum, whereas in the left panel they do not.
Figure 1: Examples of three-dimensional genotypic fitness landscapes induced by FGM with two phenotypic dimensions ( and ). The panels show the projection of the discrete genotype space onto the phenotype plane, where the phenotypic optimum is represented by a black . In the left panel, the binary sequence notation for genotypes is indicated. The wild-type genotype 000, marked by a green , is located at distance from the phenotypic optimum. The nodes represented by red ’s are local fitness maxima of the genotypic landscapes, as can be seen from the contour lines of constant fitness. In the right panel the mutant phenotypes overshoot the optimum, whereas in the left panel they do not.

As can be seen from the figure, the projection of the discrete genotype space onto the continuous phenotype space can give rise to multiple genotypic fitness maxima, although the phenotypic landscape is single peaked. It is the assumption of a finite (and hence discrete) set of phenotypic mutation vectors that distinguishes our setting from much of the earlier work on FGM, where mutations are drawn from a continuum of alleles [12] and the probability of further improvement (as given by Equation 1) vanishes only strictly at the phenotypic optimum. Remarkably, our analysis shows that the conventional setting is not simply recovered by taking the number of mutational vectors to infinity; rather, the number of genotypic fitness maxima is found to increase exponentially with .

Since fitness decreases monotonically with the distance to the optimum phenotype, a natural proxy for fitness is the negative squared magnitude of the phenotype vector

where denotes the scalar product between two vectors and . This quantity is thus seen to consist of a part that is additive across loci with coefficients given by the scalar products , and a pairwise epistatic part with coefficients .

It is instructive to decompose Equation 4 into contributions from the mutational displacements parallel and perpendicular to . Writing with , Equation 4 can be recast into the form

The first term on the right-hand side contains both additive and epistatic contributions associated with displacements along the direction. The second term is dominated by the diagonal contributions with and is of order because on average.

We now show how the first term on the right-hand side of Equation 5 can be made to vanish for a range of . For this purpose, consider the subset of phenotypic displacement vectors for which the component in the direction of is negative. There are on average such mutations, and the expected value of each component is

where the factor 2 in front of the integral arises from conditioning on . Setting for out of these vectors and for all other mutations, the sum inside the brackets in Equation 5 becomes approximately equal to , which cancels the term for . Since can be at most in a typical realization, such genotypes can be constructed with a probability approaching unity provided .

We will see below that the structure of the genotypic fitness landscapes induced by FGM depends crucially on whether or not the phenotypes of multiple mutants are able to closely approach the phenotypic optimum. Assuming that the contributions from the perpendicular displacements in Equation 5 can be neglected, which will be justified shortly, the simple argument given above shows that a close approach to the optimum is facile when , but becomes unlikely when . This observation hints at a possible transition between different types of landscape topographies at some value of which is proportional to . The existence and nature of this transition is a central theme of this article.

Scaling limits

Since we are interested in describing complex organisms with large phenotypic and genotypic dimensions, appropriate scaling relations have to be imposed to arrive at meaningful asymptotic results. Three distinct scaling limits will be considered.

  1. Fisher’s classic result (Equation Equation 1) shows that the distance of the wild type from the phenotypic optimum has to be increased with increasing to maintain a nonzero fraction of beneficial mutations for . In our notation Fisher’s parameter is

    and hence Fisher scaling implies taking at fixed ratio . We will extend Fisher’s analysis by computing the probability of sign epistasis between pairs of mutations for fixed and large , which amounts to characterizing the shape of genotypic fitness landscapes of size .

  2. We have argued above that the distance toward the phenotypic optimum that can be covered by typical multiple mutations is of order , and hence the limit is naturally accompanied by a limit at fixed ratio

    From a biological point of view, one expects that , which motivates considering the limit at constant phenotypic dimension . Under this scaling, the first term on the right-hand side of Equation 5 is of order , whereas the contribution from the perpendicular displacements is only . Thus in this regime the topography of the fitness landscape is determined mainly by the one-dimensional mutational displacements in the direction, which is reflected by the fact that the genotypic complexity is independent of to leading order and coincides with its value for the case , in which the perpendicular contribution in Equation 5 does not exist (see Results).

  3. By contrast, the perpendicular displacements play an important role when both the phenotypic and genotypic dimensions are taken to infinity at fixed ratio

    Combining this with the limit at fixed , both terms on the right-hand side of Equation 5 are of the same order . Fisher’s parameter (Equation Equation 7) is then also a constant given by .

Preliminary considerations about genotypic fitness maxima

To set the stage for the detailed investigation of the number of genotypic fitness maxima in Results, it is useful to develop some intuition for the behavior of this quantity based on the elementary properties of FGM that have been described so far. For this purpose we consider the probability for the wild type to be a local fitness maximum, which is equal to the probability that all the mutations are deleterious. Since mutations are statistically independent, we have

where is the error function. Under the (highly questionable) assumption that this estimate can be applied to all genotypes in the landscape, we arrive at the expression

for the expected number of genotypic fitness maxima.

Consider first the scaling limit 2, where . Expanding the error function for small arguments as we obtain

for , where was defined in Equation 6. We will show below that this expression correctly captures the asymptotic behavior for very large but generally grossly underestimates the number of maxima. The reason for this is that for moderate values of (in particular for ), the relevant mutant phenotypes are much closer to the origin than the wild type, which entails a mechanism for generating a large number of fitness maxima that grows exponentially with .

Such an exponential dependence on is expected from Equation 11 in the scaling limit 3, where is a nonzero constant and the expression in the square brackets is . Although this general prediction is confirmed by the detailed analysis for this case, the behavior of the number of maxima predicted by Equation 11 will again turn out to be valid only when is very large. In particular, whereas Equation 11 is an increasing function of for any , we will see below that the expected number of maxima actually decreases with increasing phenotypic dimension (hence increasing ) in a substantial range of . In qualitative terms, this can be attributed to the effect of the perpendicular displacements in Equation 5, which grows with and makes it increasingly more difficult for the mutant phenotypes to closely approach the origin.

The observation that the number of genotypic fitness maxima grows exponentially with in most cases motivates us to make use of the corresponding growth rate as a measure of the ruggedness of the landscape. We therefore define the genotypic complexity through the limiting relation

where is the average number of genotypic fitness maxima and is the sequence length. Since the total number of binary genotypes is , the complexity is bounded from above by . If any genotype had the same probability of being a fitness maximum (which is in fact not the case for FGM), we could write and hence .

Data availability

The authors state that all data necessary for confirming the conclusions presented in the article are represented fully within the article. All numerical calculations including simulations described in this work were implemented in Mathematica and C++. When counting the number of local genotypic maxima, we checked all genotypes and counted the exact number for a randomly realized landscape, then took an average. All relevant source codes are available upon request.

Results

Preliminary note

In the following sections our results on the structure of genotypic fitness landscapes induced by FGM are stated in precise mathematical terms and the key steps of their derivation are outlined, with some technical details relegated to the appendices. To facilitate the navigation through the inevitable mathematical formalism, we display the definitions of the most commonly used mathematical symbols in Table ?. Moreover, we provide numbered summaries at the end of each subsection which state the main results without resorting to mathematical expressions.

Sign epistasis

Random mutations:

We first study the local topography of the fitness landscape around the wild type, focusing on the epistasis between two random mutations with phenotypic displacements and . Since fitness is determined by the magnitude of a phenotypic vector, i.e., the distance of the phenotype from the origin, the epistatic effect of the two mutations can be understood by analyzing how the magnitudes of the four vectors , , and are ordered. To this end, we introduce the quantities

where division by guarantees the existence of a finite limit for . The sign of these quantities determines whether a mutation is beneficial or deleterious. For example, if , the mutation is beneficial; if the two mutations combined together confer a deleterious effect; and so on. We will see later that and are actually closely related to the selection coefficients of the respective mutations.

We proceed to express the different types of pairwise epistasis defined by [56] and [38] in terms of conditions on the quantities defined in Equation 14. Without loss of generality we assume and consider first the case where both mutations are beneficial, . Then magnitude epistasis (ME), the absence of sign epistasis, applies when the fitness of the double mutant is higher than that of each of the single mutants, i.e., . Similarly, for two deleterious mutations the condition for ME reads . When one mutant is deleterious and the other beneficial, in the case of ME, the double mutant fitness has to be intermediate between the two single mutants, which implies that when .

The condition for RSE reads when both single mutants are beneficial and when both are deleterious, and the remaining possibility corresponds to SSE between two mutations of the same sign. If the two single mutant effects are of different signs, RSE is impossible and SSE applies when or . Figure 2 depicts the different categories of epistasis as regions in the plane. Note that the corresponding picture for is obtained by exchanging .

Figure 2: Domains in the (R_2, R) plane contributing to different types of epistasis: ME, SSE, and RSE. The two panels illustrate the two cases: (A) R_1 > 0 and (B) R_1 <
0. The red solid lines indicate R=R_1 + R_2. The labeling of the domains D_1, \ldots, D_6 is used in the derivation in .
Figure 2: Domains in the plane contributing to different types of epistasis: ME, SSE, and RSE. The two panels illustrate the two cases: (A) and (B) . The red solid lines indicate . The labeling of the domains is used in the derivation in .

To find the probability of each epistasis, we require the joint probability density . In Appendix A it is shown that

which can be obtained rather easily by resorting to the CLT. The applicability of the CLT follows from the fact that and are sums of a large number of independent terms for [52]. According to the CLT, it is sufficient to determine the first and second cumulants of these quantities. Denoting averages by angular brackets, we find the mean , the variance , and the covariance (). Similarly, the corresponding quantitites evaluated for are , , and (). With an appropriate normalization constant, this leads directly to Equation 15.

As a first application, we rederive Fisher’s Equation 1 by integrating over the region for all and , which indeed yields

An immediate conclusion from the form of is that it is unlikely to observe sign epistasis for large , because becomes concentrated along the line as increases. As can be seen in Figure 2, this line touches the region of SSE in one point for , whereas it maintains a finite distance to the region of RSE everywhere. This indicates that the probability of RSE decays more rapidly with increasing than the probability of SSE. Moreover, one expects the latter probability to be proportional to the width of the region around the line , where the joint probability in Equation 15 has appreciable weight, which is of order .

To be more quantitative, we need to integrate over the domains in Figure 2 corresponding to the different categories of epistasis. In ?, we obtain the asymptotic expressions

and

for the probabilities of RSE () and SSE (). Due to the nonlinearity of the phenotype-fitness map, FGM does not allow for strictly nonepistatic combination of fitness effects. The probability of ME, therefore, is given by . Interestingly, the probability of sign epistasis varies nonmonotonically with . To confirm our analytic results, we compare our results with simulations in Figure ?, which shows an excellent agreement.

 Comparison of analytic results for the probability of epistasis with simulations. Depicted are probabilities of SSE (P_{\mathrm{SSE}}) and RSE (P_\mathrm{RSE}) between two randomly chosen mutations among nearest neighbor genotypes of the wild type (A) as functions of n for fixed Fisher parameter x=0.5 and (B) as functions of x for fixed phenotypic dimension n=640. For each parameter set, 10^4 randomly generated landscapes were analyzed. The asymptotic expressions provide accurate approximations even for moderate n>10. The nonmonotonic behavior with respect to x means that the probabilities are nonmonotonic functions of Q for fixed n and vice versa.
Comparison of analytic results for the probability of epistasis with simulations. Depicted are probabilities of SSE () and RSE () between two randomly chosen mutations among nearest neighbor genotypes of the wild type (A) as functions of for fixed Fisher parameter and (B) as functions of for fixed phenotypic dimension . For each parameter set, randomly generated landscapes were analyzed. The asymptotic expressions provide accurate approximations even for moderate . The nonmonotonic behavior with respect to means that the probabilities are nonmonotonic functions of for fixed and vice versa.

Similarly, we can calculate the probabilities of sign epistasis conditioned on both mutations being beneficial, which in our setting means . The conditioning requires normalization by the unconditional probability of two random mutations being beneficial, which is given by the square of in Equation 1. Hence

and

where denotes the integral of the joint probability density over the domain in Figure 2 (see ?).

As anticipated from the form of Equation 15, the fraction of sign-epistatic pairs of mutations decreases with increasing phenotypic dimension , and this decay is faster for RSE () than for SSE (). At first glance this might seem to suggest that FGM has little potential for generating rugged genotypic fitness landscapes. However, as we will see below, the results obtained in this section apply only to the immediate neighborhood of the wild-type phenotype. They are modified qualitatively in the presence of a large number of mutations that are able to substantially displace the phenotype and allow it to approach the phenotypic optimum.

Mutations of fixed effect size:

As a slight variation to the previous setting, one may consider the fraction of sign epistasis conditioned on the two single mutations to have the same selection strength, as recently investigated by [45]. In our notation this implies that , and it is easy to see that sign epistasis is always reciprocal in this case. If the two mutations are beneficial, , and the condition for (reciprocal) sign epistasis is . The corresponding probability is

Following the same procedure for deleterious mutations () one finds that the probability is actually symmetric around and hence depends only on .

To express in terms of the selection coefficient of the single mutations, we introduce a Gaussian phenotypic fitness function of the form

where is a measure for the strength of selection. The selection coefficient of a mutation with phenotypic effect is then given by

To fix the value of we note that the largest possible selection coefficient, which is achieved for mutations that reach the phenotypic optimum, is , and hence is related to the selection coefficient through . With this substitution, the result in Equation 19 becomes

The probability of sign epistasis conditioned on selection strength takes on its maximal value in the neutral limit and decreases monotonically with . Similar to the results of Equations Equation 16, Equation 17, and Equation 18 for unconstrained mutations, it also decreases with increasing phenotypic dimension when and are kept fixed.

Figure 3: Probability of RSE \tilde{P}_\mathrm{RSE} conditioned on the selection coefficients S of the two single mutations to be equal and positive: (A) for the full range of S on a linear scale and (B) for S/S0 smaller than 0.2 on a semilogarithmic scale. Here, the fitness of a phenotype \vec{y} is assumed to be W(\vec{y}) = W_0 \exp(-\lambda |\vec{y}|^2), where the parameter \lambda is related to the maximal beneficial selection coefficient S_0 through the relation S_0 = \lambda Q^2. Dashed lines depict the asymptotic expression Equation , and solid lines were obtained numerically using the Gaussian approximation for the distribution of epistasis developed by .
Figure 3: Probability of RSE conditioned on the selection coefficients of the two single mutations to be equal and positive: (A) for the full range of S on a linear scale and (B) for S/S0 smaller than 0.2 on a semilogarithmic scale. Here, the fitness of a phenotype is assumed to be , where the parameter is related to the maximal beneficial selection coefficient through the relation . Dashed lines depict the asymptotic expression Equation , and solid lines were obtained numerically using the Gaussian approximation for the distribution of epistasis developed by .

In a previous numerical study carried out at finite and , it was found that varies nonmonotonically with for the case of beneficial mutations, and displays a second peak at the maximum selection coefficient [45]. The two peaks were argued to reflect the two distinct mechanisms giving rise to sign epistasis within FGM [3]. Mutations of small effect correspond to phenotypic displacements that proceed almost perpendicularly to the direction of the phenotypic optimum, and sign epistasis is generated through antagonistic pleiotropy. On the other hand, for mutations of large effect, the dominant mechanism for sign epistasis is through overshooting of the phenotypic optimum. Because of the Fisher scaling implemented in this section with at fixed , the second class of mutations cannot be captured by our approach and only the peak at small remains. Figure 3A shows the full two-peak structure for a few representative values of , and Figure 3B illustrates the convergence to the asymptotic expression Equation 22 for the left peak. Using the results of [45], it can be shown that the right peak becomes a step function for , displaying a discontinuous jump from to at .

Summary 1:

When the phenotypic dimension is large and the Fisher parameter is moderate, the probability of RSE decays as , while that of SSE decays as . Although these probabilities decrease monotonically with at fixed , they have a nonmonotonic behavior as a function of : For small they increase with and for large they decrease with (see Figure ?). Under the pleiotropic scaling adopted in this work, this implies that the probabilities are nonmonotonic function of the wild-type distance at fixed and vice versa. In contrast, under the total effect model, where both the wild-type distance and scale as , the probabilities decrease monotonically and exponentially with .

Genotypic complexity at a fixed phenotypic dimension

In this section, we are interested in the number of local maxima in the genotypic fitness landscape. We focus on the expected number of maxima, which we denote by , and analyze how this quantity behaves in the limit of large genotypic dimension, , when the phenotypic dimension is fixed. For the sake of clarity, the (unique) maximum of the phenotypic fitness landscape will be referred to as the phenotypic optimum throughout.

The number of local fitness maxima:

Since fitness decreases monotonically with the distance to the phenotypic optimum, a genotype is a local fitness maximum if the corresponding phenotype defined by Equation 3 satisfies

for all . The phenotype vector appearing on the right-hand side of this inequality arises from , either by removing a mutation vector that is already part of the sum in Equation 3 () or by adding a mutation vector that was not previously present (). The condition in Equation 23 is obviously always fulfilled if , that is, if the phenotype is optimal, and we will see that in general the probability for this condition to be satisfied is larger the more closely the phenotype approaches the origin. A graphical illustration of the condition in Equation 23 is shown in Figure 4.

Figure 4: Illustration of the condition for a genotype to be a local fitness maximum. The circle encloses phenotypes that have higher fitness than the focal phenotype \vec{z}(\tau). For \tau to be a genotypic fitness maximum, both a phenotype with a further mutation (dash-dotted green arrow) and a phenotype without one of the mutations in \tau (red segment and blue dotted arrows) should lie outside the circle.
Figure 4: Illustration of the condition for a genotype to be a local fitness maximum. The circle encloses phenotypes that have higher fitness than the focal phenotype . For to be a genotypic fitness maximum, both a phenotype with a further mutation (dash-dotted green arrow) and a phenotype without one of the mutations in (red segment and blue dotted arrows) should lie outside the circle.

The ability of a phenotype to approach the origin clearly depends on the number of mutant vectors it is composed of, and all phenotypes with the same number of mutations are statistically equivalent. The expected number of fitness maxima can therefore be decomposed as

where is the number of possible combinations of out of mutation vectors and is the probability that a genotype with mutations is a fitness maximum. The latter can be written as

with

Here and below, stands for the integral over .

Equation 25 can be understood as follows. First, the function constrains to be the phenotype of as defined in Equation 3. Next, the integration domains of the ’s reflect the condition in Equation 23. Assuming, without loss of generality, that the genetic loci are ordered such that for and for , the maximum condition for requires , so the integration domain should be ; whereas for the condition is , corresponding to the integration domain . Using the integral representation of the function

we can write

where

It was argued on qualitative grounds in Model that phenotypes that approach arbitrarily close to the origin are easily generated when the scaled wild-type distance is small, but they become rare for large . As a consequence, it turns out that the main contribution to the integral over in Equation 27 comes from the region around the origin for small , but shifts to a distance along the direction for large . To account for this possibility, it is necessary to divide the integral domain into two parts, and , where is an arbitrary non-zero number with as . Thus, we write as

where

and correspondingly define and as

The total number of local maxima is then .

Figure 5: Plots of mean number of local maxima \mathcal{N} as a function of the genotypic dimension L for q =0, 0.2, 0.4, and 0.6 with n=1 on a semilogarithmic scale. Data from numerical simulations are represented as dots, and the analytical prediction of Equation  is shown as solid lines. Each dot represents the average over 10^5 realizations of landscapes. In this parameter regime, \mathcal{N} grows exponentially with L and the growth rate (i.e., the slopes of the lines) decreases with increasing q.
Figure 5: Plots of mean number of local maxima as a function of the genotypic dimension for , 0.2, 0.4, and 0.6 with on a semilogarithmic scale. Data from numerical simulations are represented as dots, and the analytical prediction of Equation is shown as solid lines. Each dot represents the average over realizations of landscapes. In this parameter regime, grows exponentially with and the growth rate (i.e., the slopes of the lines) decreases with increasing .

Regime I:

We first consider . Expanding around the origin , we show in ? that

For an interpretation of Equation 32 it is helpful to refer to Figure 4. Note first that the probability that lies in the ball with radius is

where is the volume of the ball. We need to estimate how small has to be for to be a local fitness maximum with an appreciable probability. Since the random vectors contributing to are statistically equivalent, it is plausible to assume that their average component parallel to is . We further assume that the conditional probability density of these vectors, conditioned on their sum reaching the ball around the origin, can be approximated by a Gaussian, which consequently has the form

For to be a phenotype vector of a local maximum, all these random vectors should lie in the region and the remaining (unconstrained) vectors should lie in . This event happens with probability

Thus, we can estimate the typical value of as the solution of

which, combined with Equation 33, indeed gives Equation 32.

Figure 6: Comparison of simulation results (symbols) of the mean number of local maxima \mathcal{N} with analytic approximations (lines) for q>q_c. Each symbol is the result of averaging over 2 \times 10^6 realizations. (A) \mathcal{N} is shown to increase with n for fixed q. (B) \mathcal{N} is shown to decrease with q for fixed n. (C) Deviation of the analytic expression from the simulation results, defined as 1 -
        \frac{\mathcal{N}_{\text{data}}}{\mathcal{N}_{\text{theory}}}, is depicted as a function of L on a double logarithmic scale. The phenotypic dimension for this panel is n =4, where the largest deviations are observed in (A). The deviation decreases inversely with L as indicated by the black dashed line with slope -1.
Figure 6: Comparison of simulation results (symbols) of the mean number of local maxima with analytic approximations (lines) for . Each symbol is the result of averaging over realizations. (A) is shown to increase with for fixed . (B) is shown to decrease with for fixed . (C) Deviation of the analytic expression from the simulation results, defined as , is depicted as a function of on a double logarithmic scale. The phenotypic dimension for this panel is , where the largest deviations are observed in (A). The deviation decreases inversely with as indicated by the black dashed line with slope .

To find the asymptotic behavior of for large , we use Stirling’s formula in Equation 31 and approximate the summation over by an integral over . This yields

where the exponent is given by

Under the condition , the remaining integral with respect to can be performed by expanding to second order around the saddle point determined by the condition

Performing the resulting Gaussian integral with respect to one finally obtains

where is the solution of Equation 36, which is the (scaled) mean number of mutations in a local maximum. We will call the mean genotypic distance. This solution is not available in closed form, but it can be shown that and for small . Figure 5 compares Equation 37 with the mean number of local maxima obtained by numerical simulations for various ’s with , to show an excellent agreement even for .

It is obvious that will eventually be negative as increases for any value of , and this must be true also for the maximum value . Indeed, we found the threshold , above which is negative. This signals a phase transition in the landscape properties. Inspection of Equation 35 shows that the transition is driven by a competition between the abundance of genotypes with a certain number of mutations and their likelihood to bring the phenotype close to the optimum. The first two terms in the expression for are the standard sequence entropy [see, for example, [44]] which is maximal at (), whereas the last term represents the statistical cost associated with “stretching” the phenotype toward to origin. With increasing , the genotypes contributing to the formation of local maxima become increasingly atypical, in the sense that they contain more than the typical fraction of mutations, and increases. For the cost can no longer be compensated by the entropy term and becomes negative. In this regime decreases exponentially with , and therefore the total number of fitness maxima , which by construction cannot be , must be dominated by the second contribution .

Regime II:

We defer the detailed derivation of to ? and here only report the final result obtained in the limit , which is independent of and reads

This expression is valid for , but it dominates the contribution for large only when . Figure 6 indeed shows that Equation 38 approximates the mean number of local maxima for , that is, converges to for large . This figure also shows, as is clear by Equation 38, that is a increasing (decreasing) function of () for a fixed value of (). The expected number of maxima is small in absolute terms in this regime, which can be attributed to the fact that the expression inside the parentheses in Equation 38 takes the value at , and decreases rapidly toward unity for larger .

Figure 7: Plot of the genotypic complexity \Sigma^\ast as a function of the scaled phenotypic wild-type distance q. Here the phenotypic dimension n is kept finite while taking the genotypic dimension L to infinity. The complexity vanishes at the phase transition point q = q_c\approx 0.924~809. Inset: Plot of the mean genotypic distance \rho^* of local maxima from the wild type as a function of q. Starting from 1/2, \rho^* increases with q for q<q_c and remains at 1/2 for q>q_c.
Figure 7: Plot of the genotypic complexity as a function of the scaled phenotypic wild-type distance . Here the phenotypic dimension is kept finite while taking the genotypic dimension to infinity. The complexity vanishes at the phase transition point . Inset: Plot of the mean genotypic distance of local maxima from the wild type as a function of . Starting from , increases with for and remains at for .

To understand the appearance of , we refer to Model, where it was argued that is the maximal distance toward the origin, which can be covered by a phenotype made up of typical mutation vectors. Correspondingly, the analysis in ? shows that the main contribution to comes from phenotypes located at a distance from the origin, i.e., at a distance from the wild type. The sum over in Equation 31 is dominated by typical genotypes with , and therefore the main contribution to comes from phenotypes at a distance from the origin. The seeming divergence of as is an artifact of the approximation scheme, which assumes that the main contribution comes from the region where ; clearly this assumption becomes invalid when . We note that for very large and large , Equation 38 reduces to the expression obtained in Equation 11 on the basis of Fisher’s formula for the fraction of beneficial mutations from the wild-type phenotype.

Figure 8:  Coexistence of the two mechanisms I and II for q_0< q< q_c. (A) Two-dimensional histogram of the number of fitness maxima and the average phenotypic distance of the maxima to the optimum within a single realization. Here L=15 and n=2 are used and 10^4 different landscapes are randomly generated for each value of q. Only a small number of realizations have a small average distance but these contribute an exceptionally large number of fitness peaks. (B) Two examples of genotype-phenotype maps selected from realizations with q=0.5, L=6, and n=2. The wild type phenotype is marked by a green \blacktriangle and local fitness maxima by red \blacksquare’s. When the phenotypes of the local fitness maxima are close to (far away from) the origin, the number of maxima is large (small), which corresponds to mechanism I (II).
Figure 8: Coexistence of the two mechanisms I and II for . (A) Two-dimensional histogram of the number of fitness maxima and the average phenotypic distance of the maxima to the optimum within a single realization. Here and are used and different landscapes are randomly generated for each value of . Only a small number of realizations have a small average distance but these contribute an exceptionally large number of fitness peaks. (B) Two examples of genotype-phenotype maps selected from realizations with , , and . The wild type phenotype is marked by a green and local fitness maxima by red ’s. When the phenotypes of the local fitness maxima are close to (far away from) the origin, the number of maxima is large (small), which corresponds to mechanism I (II).

Phase transition:

To sum up, the leading behavior of is

with and given by Equation 37 and Equation 38, respectively. Since decreases to zero with in a power-law fashion at , the dominant contribution at this value is . At , the mean genotypic distance jumps discontinuously from to ; and the mean phenotypic distance , which is defined as the averaged magnitude of phenotype vectors for local maxima, jumps from to . The genotypic complexity defined in Equation 13 is given by

where is the solution of Equation 36, and hence vanishes continuously at . These results are graphically represented in Figure 7. Recall that the value attained at is the largest possible, because the total number of genotypes is . Remarkably, these leading order results are independent of the phenotypic dimension. A dependence on emerges at the subleading order, and it affects the number of fitness maxima in qualitatively different ways in the two phases. For , the preexponential factor in Equation 37 is a power law in with exponent and hence decreases with increasing ; whereas the expression in Equation 38 describing the regime increases exponentially with .

Interpretation:

The phase transition reflects a shift between two distinct mechanisms for generating genotypic complexity in FGM, which are analogous to the two origins of pairwise sign epistasis that were identified by [3] and discussed above in Sign epistasis. In regime I (), the mutant phenotype closely approaches the origin and multiple fitness maxima are generated by overshooting the phenotypic optimum. By contrast, in regime II (), the phenotypic optimum cannot be reached and the genotypic complexity arises from the local curvature of the fitness isoclines. These two situations are exemplified by the two panels of Figure 1. For the sake of brevity, in the following discussion we will refer to the two mechanisms as mechanism I and mechanism II, respectively.

The approach to the origin in regime I is a largely one-dimensional phenomenon governed by the components of the mutation vector along the direction of the wild-type phenotype , which explains why the leading order behavior of the genotypic complexity is independent of . For , the -dependence of the preexponential factor in Equation 37 arises from the increasing difficulty of the random walk formed by the mutational vectors to locate the origin in high dimensions. By contrast, mechanism II operating for relies on the existence of the transverse dimensions, which is the reason why in Equation 38 is an increasing function of with for .

When , both mechanisms seem to be present simultaneously. As our analysis is restricted to the average number of local maxima, at this point we cannot decide whether both mechanisms appear in a single realization of the fitness landscape, or if one of them dominates for a given realization. To answer this question, we generated fitness landscapes randomly for given parameter sets and identified all local maxima for each landscape. We then determined the number of local maxima and averaged the phenotypic distance of the local maxima to the optimum for each realization. This mean distance will be denoted by and is itself a random variable; it should not be confused with the mean phenotypic distance , which is calculated by taking an average over all fitness peaks in all realizations, giving the same weight to each peak. The results are depicted as a two-dimensional histogram in Figure 8A.

The figure shows that the marginal distribution of displays a pronounced peak around , which corresponds to the behavior that is typical of mechanism I. For most realizations, deviates significantly from zero and only a small number of landscapes have local maxima near . However, these landscapes have many more maxima than typical landscapes and therefore dominantly contribute to the mean number of maxima . This shows that within a single realization the two mechanisms are not operative together and only a single mechanism exists. Since most realizations exhibit mechanism II, whereas the mean number of local maxima grows exponentially as expected for mechanism I, we conclude that mechanism I occurs rarely but once it does, it generates a huge number of local maxima, which compensates the low probability of occurrence. We may thus say that both mechanisms coexist for