# Is modularity the reason why recombination is so ubiquitous?

## Abstract

Homologous recombination is an important operator in the evolution of biological organisms. However, there is still no clear, generally accepted understanding of why it exists and under what circumstances it is useful. In this paper we consider its utility in the context of an infinite population haploid model with selection and homologous recombination. We define utility in terms of two metrics - the increase in frequency of fit genotypes, and the increase in average population fitness, relative to those associated with selection only. Explicitly, we explore the full parameter space of a two-locus two-allele system, showing, as a function of the landscape and the initial population, that recombination is beneficial in terms of these metrics in two distinct regimes: a relatively landscape independent regime - the *search* regime - where recombination aids in the search for a fit genotype that is absent or at low frequency in the population; and the *modular* regime, where recombination allows for the juxtaposition of fit “modules” or Building Blocks. Thus, we conclude that the ubiquity and utility of recombination is intimately associated with the existence of modularity and redundancy in biological fitness landscapes.

## 1Introduction

The existence, prevalence and utility of genetic recombination is an old and enduring puzzle of biology [?]. Seminal works, such as [?] among others, have provided theoretical justifications that add to a long list of putative mechanisms that may account for recombination’s enduring role in most higher species. Classic [?] and more contemporary reviews [?] on the subject summarize many of these candidates. Even though the number of potential explanations is large, none of them has been found compelling enough to have settled the debate. Additionally, some older propositions have come under more scrutiny thanks to improved experimental data [?], and it has even been suggested that the hidden value of sexual recombination might not even lie mainly in the improvement of genetic variability or fitness, or in its defining properties. As stated in [?]: “…it is generally accepted that the long-term maintenance and ubiquity of Eukaryotic sex cannot be explained as an approximate consequence of the inherent properties of sex itself.”, a position exemplified in [?], where it is suggested that recombination might serve mainly as a stabilizer of mitosis, and that any drawn benefit regarding genetic inheritance is circumstantial. The plethora of proposed models ranges from simple ones that are case based [?], to sophisticated simulations that incorporate many-locus, multiple allele genotypes, dynamic recombination rates and sites [?], different levels and types of epistasis, mutation, complex and variable fitness landscapes, etc. [?]. Studies typically focus on measuring the effects of recombination on average fitness, but others concentrate on other quantifiable benefits; [?], for example, reports the virtues of recombination regarding the exploration of the fitness landscape, while in [?] the change over generations of the genetic linkage distance between epistatic units is discussed; and [?] focuses on the mean time for a beneficial epistatic group of two alleles to appear on the same gamete with and without recombination. For a review on the experimental backing or counterevidence to theoretical explanations for the prevalence of recombination see [?].

Of course, if we are to understand the benefits of recombination in the context of a mathematical model, a requirement is that the model itself captures the very mechanisms by which it is useful in the first place. This then leads us to ask if the apparent inability to find an agreed universal advantage for recombination is due to the fact that the considered models are incapable of modeling the benefits - a defect of the model - or, rather, that the benefits are not transparent in the analyses of the models that have been studied. If the models themselves are inadequate then new models with new features must be developed. On the contrary, if the analyses themselves are at fault, one must understand why. In this paper we will start with the hypothesis that standard population genetics models are capable of showing universal mechanisms by which recombination is useful. However, by restricting to a simple two-locus two-allele model we will be able to exhaustively study the full parameter space of the model. We will show that the reason why universal mechanisms have been difficult to identify is twofold: that the benefits are more visible in terms of Building Blocks (subsets of loci defined by the recombination distribution) not genotypes, as in standard analyses, and that the benefits of recombination are particularly associated with “modular” landscapes which will be discussed below. Thus, we believe, the results of this paper link two fundamental concepts in biology - the utility and ubiquity of recombination with the existence of modularity.

## 2Recombination - a Building Block Perspective

In this section we introduce the theoretical framework and the chief diagnostics we will use to examine the utility of recombination. As we are interested here in the interaction of selection and homologous recombination we will omit mutation. We will consider the evolution^{1}

where is the expected frequency of genotype at generation . In the first term on the right-hand side is the selection probability for the genotype . For proportional selection, which is the selection mechanism we will consider here, , where is the “survival” fitness^{2}

where is an indicator function that represents the conditional probability that the offspring genotype is formed given the parental genotypes and and the mask . For example, for two loci, , with binary alleles, and , , while . The contribution of a particular mask depends, as we can see, on all possible parental combinations. In this sense, , in the space of genotypes, is an exceedingly complicated function. In the case of diploids, the SWLD coefficient is equivalent to the functions of Nagylaki [?] and described in [?]. For a given target genotype and mask, is a matrix on the indices and associated with the parents. For binary alleles, for every mask there are possible combinations of parents that need to be checked to see if they give rise to the offspring . Nevertheless, only elements of the matrix are non-zero. The question is: which ones? Although, , or equivalently or , gives a complete summary of the effect of recombination in a given generation it is an exceedingly complicated function to analyze. However, the complication of in terms of genotypes is just an indication of the fact that the latter are not a natural basis for describing the action of recombination.

A more appropriate basis is the Building Block Basis (BBB) [?], wherein only the Building Block (BB) schemata that contribute to the formation of a genotype enter. In this case ^{3}

where is the selection probability of the BB and is the complementary block such that . Both blocks are uniquely specified by the associated recombination mask, . For instance, for three loci, , if and then and , where is the canonical “wildcard” symbol, familiar from Evolutionary Computation, indicating that the corresponding locus has been summed over thus leading to marginal probabilities. Thus, the probability for the schema is . The selection probability for the BB schema is , where the fitness of is and depends on the actual composition of the population. It is important to emphasize that the SWLD is distinct from the well-known linkage disequilibrium coefficient, , which depends only on the allele frequencies and the crossover mask , and *not* on the fitness landscape. In the case of a flat fitness landscape, , but not otherwise. In particular, a population at linkage equilibrium with does not necessarily satisfy . Selection effects generally move the system away from the Geiringer or Robbins manifold [?], which is the set of points in the space of populations defined by . In terms of BBs,

with and being the frequencies, not the selection probabilities, of the BBs and . Therefore, in linkage equilibrium implies , i.e., the probability to find any genotype is the same as the product of the probabilities to find its constituent BBs. Thus, at linkage equilibrium the SWLD coefficient is given by

Note that the structure of is particularly simple when both and are BB schemata. For a given and one unique BB, , is picked out. The second BB then enters as the complement of in . This means that is skew diagonal on the indices and , with only one non-zero element on that skew diagonal for a given and . At a particular locus of the offspring, the associated allele is taken from the first or second parent according to the value of . If it is taken from the first parent, then the corresponding allele in the second parent is immaterial. As seen above, this fact is represented by the normal schema wildcard symbol . It is important to emphasize that the BBs form an alternative basis to that of the genotypes. This means that genetic dynamics can not only potentially be described without any reference to genotypes but also that with the dynamics of the BBs the dynamics of any and all genotypes can be derived. For instance, for two loci with binary alleles, and , the possible genotypes are , , and . The corresponding BBs are , , and , where we arbitrarily chose the genotype as the type around which to develop the BBB. The relationship between the two bases is given by

where

is the coordinate transformation matrix that transforms from one basis to another. As bases, the genotype and BBB have equivalent dynamics. However, the dynamics of recombination is fundamentally simpler in the BBB due to the immense simplification of in the latter. In other words, just as Walsh/Fourier modes [?] are the natural basis for describing mutation, so BB schemata are the natural basis for describing homologous recombination. They are the natural effective degrees of freedom of any genetic system with recombination.

From Equation (Equation 1) for the time evolution of the probability distribution for the system, we may derive the time evolution of any derived quantity, such as the average population fitness, which is given by

### 2.1Why Recombination?

As mentioned in the introduction, a great amount of work has been done on trying to understand why recombination is ubiquitous. Here, rather than trying to understand the potential benefits of homologous recombination at the most general phenomenological or conceptual level, we will restrict attention to what we may deduce purely from its mathematical representation in equation (Equation 1). Of course, it may be that the benefits of recombination are not manifest in this model. However, given that the model is the generally accepted framework for classical population genetics it behooves us to at least use it as a starting point. Further, we will analyze the model concentrating on two simple metrics for measuring the benefits of recombination, asking: i) under what circumstances can recombination lead to the generation of a higher frequency of a fit offspring than would be the case with only selection? and, relatedly, ii) under what circumstances can recombination lead to a larger increase in the average population fitness relative to selection only? From equations (Equation 1) and (Equation 5) we see that it is the SWLD coefficient that quantifies the effect in both cases.

From equation (Equation 1), we can see that if then recombination leads, on average, to a higher frequency of the genotype than in its absence. In other words, in this circumstance, recombination is giving you more of than you would have otherwise. On the contrary, if then the converse is true, recombination provides less of the genotype of interest than would be the case in its absence. With this is mind, as mentioned, we will consider two complementary metrics to evaluate the utility of recombination in time: the change in number of optimal genotypes from one generation to the next and the change in average population fitness. In the infinite population limit, the former is given by

For fitness-proportional selection,

The first term on the right-hand side is the increase in the number of optimal genotypes due to the effect of selection only and the second term the contribution due to recombination. Now passing to the average population fitness, we can consider two reference points for measuring the effect of recombination relative to selection. The first is to consider in the infinite population limit

where, once again, the first term on the right-hand side is the contribution from selection only, and corresponds to Fisher’s Fundamental Theorem, while the second term is the contribution from recombination. In both and , we are considering metrics that measure the relative contribution of recombination generation by generation, *not* the cumulative effect of recombination versus selection. As a measure of the latter we consider

Thus, if is positive then the average fitness of the population evolving in the presence of recombination and selection (s+r) is higher than that of the same population evolving in the presence of selection only (s).

For both, generation by generation metrics the qualitative contribution of recombination is purely controlled by the sign of . For increasing the frequency of a fit genotype relative to the case of selection only, we see that this will be the case, passing from generation to generation , if and only if , with the sign and magnitude of fixed completely by the fitness landscape and the actual population. So, whether recombination is beneficial or not passing from one generation to another, in this sense, is equally fixed by the fitness landscape and the actual population. Similarly, the increase in the average population fitness from one generation to the next, relative to selection only, is controlled by the fitness weighted average of and, hence, once again, by the fitness landscape and the current population. However, in the case of the cumulative measure we see that the potential contribution of recombination is more subtle as besides the explicit term there is also the effect of the difference between and which depends implicitly on .

So, once again, we are led to ask first: When is ? The answer is when , i.e., the probability to select the genotype is less than the probability to select its component BB schemata, where the action of recombination is modeled to be such that the blocks are selected independently. There are several distinct regimes in which , which we will explore further and which categorize the different conditions under which homologous recombination can be deemed useful. First, there is the regime in which , i.e., the genotype is non-existent, or at a very small frequency, in the actual population. In this case directly and then, remembering that we are neglecting the effects of mutation, recombination is the only mechanism by which the genotype can be generated. This regime emphasizes the search property of recombination, independent of the fitness landscape.

In general though, as emphasized, the effects of recombination depend on the fitness landscape. Taking the classic Muller’s ratchet argument as a reason why recombination exists it has been shown that modifier genes that lead to higher recombination rates could increase in the presence of negative *multiplicative* epistasis [?]. However, if the epsitasis was too great the effect disappeared. Thus, in the parameter space for the landscape the advantage for recombination only appeared in a smal region and therefore could not be offered as a generic explanation for the ubiquity of recombination and sex. In other work, [?] have provided evidence that recombination is particularly beneficial in an additive landscape with zero *additive* epistasis and very detrimental in a landscape with high positive *additive* epistasis. A simple way to see this is to eliminate any bias that comes from a particular choice of initial population and assume equal proportions for all genotypes. In this situation, it can be shown that for *any* that does not cut an epistatic link between loci. For instance, for a genotype , if , i.e., the landscape is additive, then for any . This result is also valid when the correspond to multiple loci when recombination does not cut any epistatic link between the loci. This is the case for a modular landscape, where loci divide up into disjoint sets with epistasis between the loci in a set but not between sets. The benefit of recombination in this case is that it efficiently increases the number of fit non-epistatically linked BBs in an offspring genotype relative to the numbers present in the parental types. On the contrary, for a highly additively epistatic fitness landscape, such as “needle-in-a-haystack” (NIAH)^{4}

One may argue, of course, that proving that over one generation for a particular choice of population and in particular fitness landscapes does not correspond to a “universal” mechanism for explaining the benefits of recombination. That is why in this paper we consider the general situation of an arbitrary fitness landscape and an arbitrary population, as well as considering multiple generations. To consider such generality, however, the price we must pay is to restrict to a small number of loci.

So, we would argue that two significant, and potentially related, regimes in which recombination is beneficial are: i) the *search* regime, where recombination searches for fit genotypes that presently either do not exist or are at very low frequency in the population; and ii) the *modular* regime, with either weak positive or negative additive epistasis, where recombination allows for the juxtaposition of distinct fit modules in different parental types into an even fitter offspring. Of course, in the *search* regime the question arises as to whether recombination is more efficient than mutation. This will depend on the Hamming or edit distance between parents and offspring. An example, that we will not consider in more detail, that exhibits the benefits of recombination over mutation in generating innovation, is the development of antibiotic resistance in bacteria through horizontal gene transfer. Generically, it will be the case that the Hamming or edit distance between the original parental sequences, say bacterium and virus, and the offspring sequence, bacterium with viral gene, will be potentially large. In other words, the difference between the initial and final sequences is not a single-nucleotide, or even a small number of them. In this sense, recombination-like^{5}

## 3Modularity and Fitness Landscapes

Before considering our explicit model we wish to discuss the concept of modularity in terms of the fitness landscape. For simplicity, we restrict to binary alleles , where refers to the locus. We will consider two representations of the fitness function, a direct one where we use the directly and another one where the fitness function can be written as an expansion of the form

where represents an epistatic interaction between alleles located at loci and . The advantage of this latter representation is that the degree of epistasis between different loci and alleles can be simply deduced.

Any landscape that contains only Fourier components of is said to be an elementary landscape of order . For instance, a completely additive landscape has a fitness function of the form

and is therefore an elementary landscape of order one, as all Fourier components other than order one are zero. This is a consequence of the fact that there are no epistatic interactions between loci. Similarly, a multiplicative landscape, where

is an elementary landscape of order , as all Fourier components other than order are zero, there being epistatic interactions of order between the loci but no others. Other landscapes will be intermediate between these extremes. Once again, we emphasize here that we are measuring epistasis relative to the additive limit not the multiplicative one as has been the norm in most papers on recombination and population genetics.

A particularly interesting class of landscapes in terms of their relevance for recombination are those of “modular” type, where the loci of a genotype partition into disjoint subsets^{6}

the sum of the fitnesses of its constituent modules. This modularity will obviously leave an imprint in the expansion (Equation 9). For instance, if each module consists of loci and there is no epistasis between the modules then in (Equation 9) we will have for . In the second case, our notion of modularity is associated with the idea of genetic redundancy, whereby the fitness of a genotype is similar in the presence of different copy numbers of a given gene. The extreme limit of this is when the landscape is associated with an “OR” function, so that the fitness of a type is the same whether there is one or multiple copies of a gene. The intuition of a module in this context is that in the presence of redundancy with multiple copy number one, or maybe more, genes can be removed or mutated without affecting the fitness of the type. Thus, a gene acts as a module as it can be changed independently without affecting the fitness of the type. As we will see, this corresponds to a system with a maximal degree of negative epsitasis.

As mentioned previously, a full analysis for loci with arbitrary landscape and population is prohibitively difficult, so here we will focus on the case of two loci, as in this case we can study in the context of an exactly solvable model the different regimes under which recombination can be beneficial. So, restricting ourselves to the case of two loci, , we have

For an additive (modular) landscape . For a multiplicative landscape . For a redundant (modular) landscape which, as mentioned, can be understood in terms of a Boolean “OR”, fitness being the same if either one or both alleles are optimal. For a NIAH landscape which, in contrast to the redundant landscape corresponds to a Boolean “AND” as fitness is only different if both alleles are optimal.

## 4Recombination in an exact two-locus model

### 4.1Analytic results

Clearly, trying to characterize the efficacy of recombination quantitatively, and in detail, is prohibitively complicated. As we saw in Section 2, however, within the confines of the model we are considering, in a given generation, it can be characterized using only one fundamental function: the SWLD coefficient. The SWLD coefficient, though, depends not only on the recombination distribution, but also on the fitness landscape and the current state of the population. In other words it is a function of a large number of parameters. To circumvent this problem we consider the case of two loci and calculate the SWLD coefficient as a function of the fitness landscape and the population. Note that by two loci here we do not necessarily imply that they represent “genes”. They may represent any two structural units, such as exons, introns or other motifs, or nucleotides themselves, that can be separated or recombined by crossover and which can be characterized, as an approximation, by a fitness landscape that is independent of the rest of the genome.

For two loci all genotypes can be characterized by a multi-index , with , where is the cardinality of the alphabet that labels the loci, or alleles in the case of genes. For , there is only one non-trivial mask^{7}

Direct evaluation shows that

and thus the evolution equations in the two-allele, two-locus problem are:

The whole state of this system can be characterized by 3 frequencies that are naturally represented in a three dimensional simplex. Figure ? shows typical population trajectories in the two-locus, two-allele system for a generic landscape, with arbitrarily taken as the optimum genotype and several different initial population ratios.

As far as the fitness landscape is concerned the general parametrized two-locus two allele landscape is

where is the measure of the *additive* epistasis between the two loci. We take the genotype as the wild type, the genotypes and as single mutants and as a double mutant which is the optimal genotype. There are just three main landscape categories for the two-bit, two-locus model:

The wild type and the double mutant are the anti-optimum and optimum respectively.

One of the single mutants (10 or 01) is the antioptimum.

The two lowest fitness phenotypes are the single mutants.

Any other case can be brought to one of the previous by a relabeling that doesn’t affect the dynamics. In the first two landscape types, a generic population will always eventually evolve towards the global optimum. In the third type, the population may converge to the optimum or the suboptimal wild type depending on the initial population and the recombination probability.^{8}

From Equations (Equation 6) and (Equation 8) we have

For the optimal genotype

As mentioned, the sign of determines the qualitative effect of recombination in a given generation. To develop some intuition for how the characteristics of the landscape affect our metrics we will set for the moment , i.e., a homogeneous population with no initial bias for one genotype versus another. As the parameter just sets the scale for the landscape we can without loss of generality for fitness proportional selection set . We will also set so that both single mutants have the same fitness. In this case,

For a multiplicative landscape and , as is well known. For an additive landscape and therefore . In this case recombination leads to a higher frequency of the optimal genotype in the next generation than selection alone. For a deceptive landscape, , but and so and recombination in this region of the parameter space leads to a lower frequency of the optimal genotype in the next generation. In terms of BBs, for deceptive landscapes, the marginal fitnesses are such that and , and so the reason why recombination is unfavourable is that the necessary mutant alleles for constructing the optimal genotype are deleterious relative to the corresponding alleles of the genotype . For additive epistasis, such that , we have and recombination once again leads to a lower frequency of the optimal genotype in the next generation than selection alone. Generally, if we take as signifying negative multiplicative epistasis then we see that in such landscapes recombination has a positive effect in terms of our metric and on the contrary for positive multiplicative epistasis. Note that the additive limit corresponds to negative multiplicative epistasis. Interestingly, equation (Equation 15) shows that the greatest benefit from recombination, i.e., the minimum value of is associated with landscapes with negative *additive* epistasis, i.e., . Maximum negative epistasis is given by the minimum value of , . In this case .

Why would this maximum negative epistasis be associated with the utility of recombination, at least in terms of metric (Equation 6)? Examining equation (Equation 14) we see that the first term, proportional to , corresponds to elimination of the optimal genotype by recombining it with the suboptimal genotype , whereas the term proportional to corresponds to construction of via recombination of the single mutants and . It is the competition between these two effects that measures the benefits of recombination in terms of (Equation 6). Additive landscapes with reduce the impact of destruction without compromising the positive effect of reconstruction. Negative epistasis, on the other hand, does not affect the construction of the optimal genotype by recombining the single mutants, but it does minimize the effect of destruction of the optimal genotype. The maximal effect is when and corresponds to a Boolean “OR” landscape where . This is the situation where there is genetic redundancy, as the fitness of the optimal phenotype requires the presence of only one optimal allele not both. At this naive level we also see that the benefit of recombination is not restricted to small negative multiplicative epistasis but, rather, the larger the additive negative epistasis the larger the benefit conferred by it.

In terms of the metric (Equation 7) the contribution from recombination is given by

For this term to give a positive contribution to the average population fitness we require . For this requires , which we will term weak positive additive epistasis. On the other hand, for , and recombination apparently leads to a decrease in the average population fitness, while in the additive limit, , there is no change. Together, a one generation analysis of our two metrics would indicate that there are benefits to recombination from both of them only for weakly positively additively epistatic landscapes such that and . We will characterize these landscapes as being “modular”, i.e., quasi-additive. It is important however, to go beyond a single generation, and for that we will consider metric (Equation 8) in Section 5.

#### Muller’s Ratchet.

Muller’s ratchet [?]^{9}

There are two regimes of interest related to Muller’s ratchet, one is that advantageous mutations appear in a population and the second that deleterious mutations appear. The question is: How does recombination affect the dynamics of these mutants? Considering the first case, if we consider the population to be such that the fit double mutant is absent, i.e., ,^{10}

From Equation (Equation 17) we see that the number of fit double mutants increases from generation to generation due to the effect of recombination relative to selection only dynamics. This is, in fact, independent of the fitness landscape, being associated with the *search* regime of recombination alluded to in Section 2.1. In contrast, in Equation (Equation 18), we see that the average population fitness will increase in the presence of recombination if and only if , which is a direct measure of the degree of *additive* epistasis between the two loci. As noted, for a purely additive landscape, and so recombination is neutral in this setting. For the other genotypes we have the fraction of wild types increases due to the effect of recombination, while the frequency of single mutants decreases. What happens in the case where will be considered in section Section 5 as the benefit from recombination then depends on the actual population as well as the landscape.

Turning now to the case of deleterious mutants: in this case we take the wild type to be the genotype and the types and to be deleterious single mutants and to be an even more deleterious double mutant. In this case, just as for beneficial mutants, and hence the proportion of optimal wild types increases. In terms of average population fitness, the increase from generation to is given by Equation (Equation 18). In other words the change in average population fitness per generation for the case of beneficial versus deleterious mutations is identical if we are considering the same fitness landscape.

#### Asymptotic behavior of

Before going on to consider the full numerical solution of the two-locus model we will consider what can be said analytically about the asymptotic behavior of the system. Although there are 7 parameters that control the dynamics, the asymptotic behavior can be most naturally written in terms of just two parameters

where, for brevity, we use for , and

The one generation evolution equation for is

Without loss of generality we again choose to be the optimal genotype. The evolution of the genotype frequencies, , as given by equation (Equation 1), ensures the eventual dominance of one of the genotypes^{11}*a priori* that the limit

exists, which in turn implies that

and

With these elements in hand we can calculate the putative limit of equation (Equation 19) to find:

Solving this last equation for we obtain:

Finally, since and , we note that the negativity of is equivalent to the condition

which reduces to for . So, we can see that the asymptotic benefit of recombination in terms of increasing the fraction of optimal genotypes relative to selection only, is determined by only 2 parameters - and and is independent of the initial population.

With this formula in hand, we can easily map any fitness landscape to a range of values for and thus determine if recombination will be asymptotically favorable for that particular landscape. we have

To simplify further the visualization of the asymptotic behavior, we again assume that , i.e., that the two mutants have the same fitness. As eventually the optimal genotype dominates for non-deceptive landscapes, recombination is asymptotically neutral. However, how approaches zero depends on . Small values of values of and correspond to a more neutral fitness landscape, where selection effects are small. For an additive landscape and so recombination is asymptotically beneficial in that tends to zero from negative values. Small values of relative to correspond to highly positively additively epistatic landscapes and in this case and recombination is asymptotically disadvantageous in that approaches zero from positive values. The multiplicative landscape with and, hence, , separates the two classes of behavior. The dependence of the parameter (= as a function of and is shown in the next graph: Values of greater than mean that the iterates must eventually reach negative values of . The sign of is then conserved, although the magnitude approaches zero as the system reaches linkage equilibrium associated with a population dominated by the optimal genotype. The opposite happens when . Note that the locus defined by the intersection of the surfaces and is given by and corresponds to the case of multiplicative landscapes.

## 5Exact Numerical Results

Turning now to the non-asymptotic behavior, we performed an exploration of the 7 dimensional parameter space of the two-locus, two-allele system to determine under which conditions recombination is beneficial in terms of our two metrics (Equation 12) and (Equation 13). In such a high dimensional space, visualization of the resulting graphs requires separation into several distinct cases. We set in all the following as just affects the magnitude of the effects of recombination but not whether it is beneficial or not as this is controlled by the sign of . ^{12}

### 5.1Recombination as a function of fitness landscape

We first consider graphs for arbitrary fitness landscapes but for a fixed initial population, with a further subdivision into cases made according to the type of initial population. As we have fixed and set we display the graphs as functions of and . The valid region, all fitnesses positive with the genotype as optimum, is given by , and . The deceptive region is given by . For ease of interpretation we also show lines associated with the multiplicative limit (yellow) and the additive limit (green). Note that both the additive and multiplicative limits require . The “needle-in-a-haystack” landscape is given by , and lies on the border that separates non-deceptive and deceptive landscapes. The point , corresponds to a flat fitness landscape where there is no selection pressure.

Two kinds of graphs are provided, one that displays the value of the SWLD coefficient in different generations, and another that displays (Equation (Equation 8)), defined as the change in average fitness between generation and generation in a population evolving with both selection and recombination minus the change in average fitness of the same population but evolving with selection only. In the graphs we show four representative time slices - , 2, 6, and 10 generations after the initial one. The plane that separates the recombination advantageous/disadvantageous regimes is displayed (turquoise in the online version). For a given generation, those values of and where are shaded in red (below the plane), while those where correspond to a darker shading (above the plane).

#### Initial Population

In this first case we consider the dynamics when the initial population is dominated by the non-optimal wild type , with , , , . So, we are here interested in the effects of recombination on the dynamics of favourable mutations as a function of the fitness landscape and in the background of an initial population dominated by a non-optimal wild type. We fix and study the variation in as a function of and , remembering the restrictions and . The most notable feature of ? is that negative values of are most associated with additive or negatively epistatic landscapes. Note that earlier in the evolution, , the benefits of recombination are clear to see, even for quite positively epistatic interactions with only deceptive landscapes showing a disadvantage. This, however, is due to this region being still in the *search* regime, as the initial frequency of optimal genotypes was zero. Gradually, the population moves away from the *search* regime and enters the *modular* regime, where we see that it is only for landscapes that are either weakly positively epistatic, additive or negatively epistatic that recombination is beneficial. Note that the relative benefit of recombination is not fixed but evolves, thus showing the dependence on the relative frequencies of the different genotypes. In terms of BBs, becomes positive when so, as the frequency of the optimal type increases, eventually recombination becomes unfavourable relative to selection only, with the point at which it becomes unfavourable, , being dependent on the fitness landscape, as well as the initial population.

Turning now to the graphs of the change in average fitness of the population; at , in the *search* regime, we see that recombination leads to an increase in average population fitness, over and above that of selection only, for basically all landscapes. This is due to the addition of optimal genotypes in an initial population dominated by the non-optimal wild type. Gradually, however the effect of recombination diminishes as one enters the *modular* regime so that for positively epistatic landscapes the difference between selection only and recombinative dynamics is minimal. However, we note that there is still a strong pronounced effect for either weakly positively epistatic, additive or weakly negatively epistatic landscapes.

So, how do we interpret these results in terms of BBs? Both in the *search* and *modular* regimes the advantage of recombination is associated with the fact that BBs of the optimal genotype, and , are recombined to form the type . As the graphs show, this recombination of BBs is, in fact, a more efficient process in generating optimal types and increasing overall population fitness than selection alone for weakly epistatic landscapes. In fact, the benefit in the *search* regime is actually relatively independent of the degree of epistasis of the landscape. Later on though, in the *modular* regime, the generation of optimal genotypes by recombining optimal BBs competes against the generation that evolved through pure selection effects. For positively epistatic landscapes, once there are enough optimal types selection can produce new ones as or more efficiently than recombination. For *modular* landscapes however, recombination retains its advantage. Indeed, this is, in fact, what characterizes the *modular* regime, i.e., that weakly epistatic BBs or modules are juxtaposed by recombination into even fitter genotypes leading to a faster evolution and a faster increase in average population fitness. The fact that the recombination is even more beneficial in the presence of additive negative epistasis is due to the fact that the destruction of the optimal type produces two single mutants that have fitness very similar to that of the optimal type. This is the advantage of genetic redundancy.

#### Initial Population

We now turn to the case where the initial population is dominated by the optimal genotype as the wild type with the presence of genotypes with a single deleterious mutation and a small proportion of deleterious double mutant genotypes. Specifically, , , and . The question now is: What is the dynamics of the deleterious mutations in the population as a function of the landscape parameters? Once again, we fix and study the variation in as a function of and ,

In Figure ? the first thing to notice is that, in distinction to the case where the initial population is dominated by the non-optimal genotype, here there is no dinstinct behavior associated with the *search* regime, as the optimal genotype is already dominant in the population. Thus, for positively epistatic landscapes the difference due to recombination is small. However, for additive or negatively epistatic landscapes we see that recombination is advantageous, with the advantage being more significant in the presence of negative epistasis. This is due to the fact that in such landscapes the elimination of the suboptimal double mutant is more efficient.

Considering now the average population fitness, we see clearly in Figure ? how the advantage of recombination manifests itself in the *modular* regime where epistasis is weak. Interestingly, we see how negatively epistatic landscapes are, in the early part of the evolution, associated with . This is due to the fact that for negative epistasis the overall contribution to the population fitness of a deleterious double mutant and an optimal genotype is less double mutant, selection can eliminate the mutations thereby purifying the population more efficiently than selection alone. The more modular the landscape the more efficient this process becomes.

#### Initial Population , ,

We now consider a scenario similar to that of sub-section ?, where the initial proportion of optimal genotypes is zero; but now, however, the frequency of the BBs, and , represented by the beneficial mutants and , relative to the less fit wild type is much higher. Concretely, the initial population is: , , and so that the BBs and form about a quarter of the population each one.

We see in Figure ? that the graphs are qualitatively similar to those of Figure ?. The chief difference now is that recombination is even more disadvantageous in the *search* regime for deceptive landscapes than before and more advantageous for modular landscapes - weak or zero positive epistasis or negative epistasis. This is due to the wider availability of the BBs and thus obstructing/facilitating the construction of the optimal type according to whether the landscape is deceptive or modular. As evolution progresses, as before, we see a passage from the *search* regime to the *modular regime*, where the relative benefit of recombination is restricted to weakly positively epistatic, additive or negatively epistatic landscapes.

Similarly, in Figure ? we see a similarity with the corresponding graphs of Figure ? the average population fitness showing a strong increase, relative to the selection only case, due to the efficient formation of the optimal type, which in its turn is due to the large number of BBs in the population. Even for strongly epistatic landscapes there is a strong benefit to recombination in this regime. At later times, in the *modular* regime, we see that the advantage of recombination is again associated with additive, weakly positively epistatic or negatively epistatic landscapes, i.e., modular landscapes.

So, we see that the principle effect of increasing the BB frequency in the initial population is to accelerate the rate of evolution so that the frequency of the optimal genotype and the average population fitness increase more rapidly.

#### Initial Population ,

We now look at an even more extreme case, where the initial population is completely dominated by the single mutants and with the initial population being , , and . Qualitatively the results are as in sub-sections ? and ?; the strong presence of the BBss and leading to a very efficient production of the optimal genotype . This is, in fact, another good illustration of Muller’s ratchet. Although recombination leads to the generation of optimal genotypes it also leads to the production of the sub-optimal double mutants . The latter, however, as the graphs clearly show, are flushed out by selection. In fact, as Figure ? shows, they are produced and then flushed out most efficiently in the presence of recombination for modular landscapes when compared to selection only.

#### Initial Homogeneous Population

The final initial population type we will consider is that of a uniform initial population where all genotypes have the same initial frequency, . Here we see behaviour that is qualitatively similar to that found for other populations. The chief difference here is that given the ample presence of the optimal genotype in the initial population there is no *search* regime and so the dynamics begins and remains in the *modular* regime. With no population bias we can see the role played by the multiplicative limit with at being positive for landscapes with positive multiplicative epistasis and, particularly, deceptive landscapes. It is negative for weakly postively epistatic, additive and negatively epistatic landscapes. As evolution progresses we can see that the relative advantage diminshes such that at the advantage of recombination is only noticeable for larger negative epistasis.

In terms of average population fitness in Figure ? we see an analogous story: at average population fitness is increased only for landscapes with negative multiplicative epistasis, up to the additive limit, but is, in fact, negative for negative additive epistasis. However, as evolution progresses, once again, we see the dominant role played by modular landscapes - i.e., weakly positively epistatic, additive and negatively epistatic landscapes.

### 5.2Recombination as a function of population

Having explored the effect of recombination on the space of fitness landscapes, by varying continuously the landscape parameters and for a variety of distinct initial populations, we now consider the complementary viewpoint of considering how the effect of recombination changes by varying continuously the initial population for a variety of fixed fitness landscapes. Due to the conservation of probability, the population vector is characterized by only three frequencies. For simplicity of visualization we will consider intitial populations such that and consider the population dynamics as a function of and .

A general observation on almost all the graphs in this section is that since there is generic convergence to the optimal genotype for non-deceptive landscapes so clearly all the surfaces have in the corner.

#### Additive landscape .

The first landscape we will consider is an additive landscape (). For this landscape (Figure ?) the tendency is clear, that the more BBs and the fewer optimal types there are, the more recombination helps. This is again a manifestion of the *search* regime. In this landscape, as can be seen at , recombination in terms of is only unfavorable when the proportion of optimal types is appropriately larger than the frequencies of the BBs, as then selection can act more efficiently to increase the frequency of the optimal type than can recombination of the single mutants. However, we see that this effect is temporary. By basically any initial population is associated with . We can see that the SWLD increases in time, approaching zero asymptotically, this regime being associated with the approach to a population completely dominated by the optimal genotype. This dynamics, in fact, shows an important universality associated with recombination, that demonstrates the role of Muller’s ratchet: that the action of recombination is to drive the system to particular frequencies for the optimal type and its BBs that correspond to quite special initial conditions at . To understand this, note that at and the proportion of optimal types is high. If we imagine the value of , for example, that is a consequence of evolution in the presence of recombination, then we can map those values such as to imagine them as initial conditions, say at , for further evolution. However, we can observe at that values of close to 1 correspond to positive values of except in a very narrow wedge where the values of are as high as possible. This wedge is associated precisely with a lower relative frequency of the suboptimal genotype. The conclusion is that recombination is removing the suboptimal genotype more efficiently than selection only.

Finally, the presence of a trough associated with quite negative values of for and is a consequence of th fact that the search regime is more extensive when the frequency of both optimal genotype and BBs is low.

#### Neutral landscape: , ()

For a neutral landscape, where the effects of selection are null, as with the additive landscape, the “the more BBs the better recombination is” rule is valid, but we see a different behavior as a function of initial population. For neutral evolution, the SWLD, , and the standard linkage disequilibrium coefficient, , are the same. So, Figure ? shows the approach to the Geiringer or Robbins manifold, defined by . The approach to this manifold is from the negative or positive side depending on whether the initial population is dominated by the BBs and , or by the optimal genotype . The Geiringer limit has been amply studied in the literature [?]. Thus, recombination is beneficial when there is an ample supply of BBs and few optimal types, and deleterious when there are no BBs. The minimal value of is for and the maximal for , .

#### Multiplicative landscape , ,

This landscape satisfies the multiplicative constraint that . Here we see that recombination is favorable in the *search* regime where the BB frequency is high and the frequency of the optimal genotype is low. However, for other than very small we can see that recombination is somewhat unfavorable when the BB frequency is relatively low but, in the main, it is generally neutral in its effects. This is consistent with known results for multiplicative landscapes. In fact, viewing the time evolution, even if one starts in the *search* regime we see that very quickly the system approaches linkage equilibrium.

#### Needle-In-A-Haystack, , , ()

We now turn to the case of a landscape with maximally positive epistasis - NIAH, which, as mentioned, has been used extensively in models of molecular evolution and, especially, in considerations of selection-mutation balance and the existence of error thresholds. Here, it corresponds to a Boolean “AND” function on the two loci. As a function of the initial population we can clearly see that in the *search* regime, where there is an ample supply of BBs and only a zero or small proportion of the optimal genotype, that recombination is favorable, both in terms of leading to a more efficient production of the optimal genotype when compared to selection only () as well as a more fit population (, Figure ?). On the other hand, away from the *search* regime it is clear that the effects of recombination are unfavorable. Note that the advantage or disadvantage of recombination decreases in time as the system gets closer to linkage equilibrium, this equilibrium being associated with a population dominated by the optimal genotype.

#### Landscape with Genetic Redundancy, , ,

For a landscape with maximal negative epistasis, corresponding to an “OR” Boolean function on the two loci we see in Figure ? that very rapidly recombination becomes beneficial in terms of for any initial population.

#### Deceptive Landscape, , ,

Finally, a deceptive landscape (Figure ?) offers a complete contrast to that of a redundant one, with recombination being disadvantageous in terms of for any initial population.

## 6Conclusion

As discussed in the introduction, genetic recombination remains a puzzle as far as having a full, intuitive understanding of why it is so prevalent, with no generally accepted explanation of its benefits. Many theoretical analyses have been performed. The vast majority of these have been in the context of variations on a theme of standard population genetics models - haploid, diploid, with modifer genes, without modifier genes, with finite population, with infinite population, with mutation, without mutation, with few loci, with many loci, with different fitness landscapes, with different population states etc. Of course, to understand the benefits of recombination in the context of a mathematical model, the model itself must contain a description of the mechanisms that explain why it is useful in the first place. The question is then: do the benefits lie outside the context of the models that have been studied, or are they hidden within the results of these models? If the former is true, then one must formulate a new model, with new features, which will then make manifest its utility. On the other hand, if the latter is the case, then it is important to have a model that can be studied exhaustively, in that there is no region of the parameter space of the model that remains unexplored. Additionally, the model should be such that the effective degrees of freedom of the underlying system are manifest.

Previous work [?], both analytical and numerical, has hinted at the fact that recombination seems to be especially useful in the context of quasi-additive landscapes, while other work has shown a role for weak negative multiplicative epistasis. However, these analyses did not cover the full parameter space of the considered models, and so there is always doubt that the landscapes or initial populations considered were not representative and therefore any identified benefits of recombination were not “universal” but, rather, tied to the specific scenario considered. To counter these arguments, in this paper, we have taken the route of fixing a simple model - a two locus, two allele system of haploid sequences with non-overlapping generations evolving in the presence of selection and homologous recombination - but have analyzed the full parameter space of the model. This corresponds to three population variables and three landscape parameters. Having fixed the model, we can begin to look for the regions of parameter space, if any, in which recombination is beneficial. Of course, we first have to define what we mean by “beneficial”. In this paper we fixed two metrics: one was the SWLD coefficient for the optimal genotype that measures the excess production of such types over and above that which is produced by selection only; and the other is the increase in average population fitness over and above that which would be produced by selection only. With these two metrics we measure the benefits of recombination in terms of its capacity to lead to higher proportions of fitter genotypes and fitter populations relative to selection only.

So, what does our analysis of the parameter space of this model tell us? The analyses we have carried out are consistent with the previous results of [?] where it was shown that there are two important, but distinct, regimes in which recombination is beneficial in terms of both the metrics that we have used to characterize its benefits. The first of these is the *search* regime, which is associated with conditions where the fittest genotype is either not present or only at low frequency. In this regime the benefit from recombination is relatively independent of the fitness landscape. However, exactly how beneficial it is does depend on both the landscape and the actual population. The second regime we have termed the *modular* regime and is associated with weakly additively epistatic landscapes, i.e., quasi-additive landscapes. However, the fact that we have analyzed the set of possible landscapes and populations, allows us to go beyond this restricted analysis and observe and characterize several important universal properties of recombination.

Firstly, in terms of there is a clear association between the sign of the epistasis and the sign of . Production of the optimal genotype () is more favorable in the presence of negative additive epistasis () than for positive additive epistasis () for beneficial mutations. It is also disfavored when single mutants ( and ) are less fit () than the suboptimal genotype . What is more, by following the dynamics across multiple generations, we see that recombinative evolution itself is directed towards favoring landscapes that are more and more modular, more and more negatively epistatic. This is a universal feature that is independent of the initial population.

In terms of the increase in average population fitness relative to selection only dynamics we see a profoundly interesting dynamic. For the different initial populations considered when investigating evolution as a function of landscape, we see that there is an initial regime (t=1) wherein there is a perceived benefit from recombination for a wide array of landscapes with, in fact, under some circumstances, a relative advantage for landscapes with positive versus negative additive epistasis. However, as evolution progresses, , the benefit from recombination has become restricted to quasi-additive or negatively additively epistatic landscapes independently of the initial population. This is best understood by viewing Figure ?, where the initial population is homogeneous, which means that it begins on the Geiringer manifold. There, we see that recombination is disfavored initially () for any positively *multiplicatively* epistatic landscape - including deceptive landscapes - and for any *additively* negatively epistatic landscape. However, very quickly the universal tendency towards favoring quasi-additive and negatively additively epistatic landscapes sets in. There are many works [?], some recent [?], in which the role of negative epistasis between mutations in evolution is discussed. It must be noted that in these references negative epistasis means sub-multiplicative epistasis, that is, epistasis is quantified with a parameter whose magnitude measures deviations of the logarithm of fitness from linearity as a function of the number of mutations. In our results we included both supra and sub-additive (concerning the sign of ) and supra and sub-multiplicative (concerning the sign of ) epistatic regimes and, importantly it is the existence of negative additive epistasis that seems to be important for recombination.

As a function of the initial population, we see a complementary but completely consistent point of view relative to that of landscape. At we can see the effect of any initial linkage disequilibrium with the sign of being strongly affected by the sign of : more/less BBs relative to or being associated with /. The effect of deception is to disfavor recombination for basically any population, while for a genetically redundant landscape it is to favor it for any initial population.

We believe that the results of this paper unite various important threads of modern evolutionary thought - the ubiquity of genetic recombination, the ubiquity of modularity and, relatedly, the ubiquity of genetic redundancy, and thereby offer a quite universal explanation of why recombination is so widespread. This paper is not the appropriate forum in which to discuss the reasons why modularity and redundancy themselves are so important. There are many papers on the subject. However, it is amazing that the benefits of recombination seem to be so intimately tied to these phenomenon, at least in the framework of the fitness landscape paradigm as discussed here. In the space of all possible landscapes we have shown that the benefits of recombination are manifest only for quasi-additive or negatively additively epistatic landscapes, a quite restricted subset of landscape space. However, it is precisely such landscapes that seem to be so common in biology. In other words our conclusion is that recombination is so widespread because it leads to important evolutionary benefits only for systems that are modular and/or redundant and and it is precisely such landscapes that seem to be the norm. This leads, indeed, to another evolutionary “chicken and egg” puzzle. Did recombination evolve to take advantage of the existence of modularity and redundancy or vice versa? We would posit that there has been a strong co-evolutionary link between the them since the beginnings of life with recombination distributions and fitness landscapes co-evolving to maximize the benefits of one with the other.

So, what are weak points of our model and analysis? Well, first of all one could criticize the simplicity of the model, although the model shares many features with previous analyses. The fact that only two loci are considered is the price we pay for being able to consider the full parameter space. However, its worth mentioning again that these “loci” could represent different levels of description from, in principle, nucleotides up to entire sets of genes. Our other restriction is that we can describe each locus in terms of two possible states. We are quite sure that no qualitative effect that we have observed here depends on the existence of only two alleles. The question is: are the effects we see and the conclusions we make from the two locus model generalizable to multi-loci models? Unfortunately, we cannot analyze exhaustively the full parameter space of such a model. For loci there are, in principle, population parameters and landscape parameters to contend with.

However, there are some related analyses with multiple loci [?], investigating numerically the dynamics for certain specific landscapes and initial populations. The results seen there are completely consistent with what we observe in full generality in this paper, i.e., that the benefits of recombination when not in the *search* regime are manifest in modular landscapes while, on the contrary, it is detrimental in the presence of high positive epistasis. In this paper we have also neglected the effects of mutation, whereas much previous work has been associated with studying how recombination interacts with mutation by positing Muller’s ratchet type regimes where the dynamics of beneficial or detrimental single mutations are considered in the presence of recombination. It is an important question to understand the relative benefits of mutation versus recombination in the context of the metrics that we have considered here. We will, indeed, return to that in a separate paper. However, it is first important to understand what benefits there are that are intrinsic to recombination without a comparison with mutation.

Finally, we have also restricted attention here to fixed-length sequences. We believe that the relation between recombination and modularity extends beyond this restriction, applying also to variable-length sequences and recombination-like genetic operators other than homologous recombination. For instance, unequal crossing over or gene duplication.

## 7Acknowledgements

This work was partially supported by DGAPA grant IN120509 and by a special Conacyt grant to the Centro de Ciencias de la Complejidad. DAR is grateful to the IIMAS, UNAM for use of their facilities. We are grateful to León Martínez and Michael Gaunt for discussions.

### Footnotes

- We will restrict attention here to a generational model with no overlap.
- By survival fitness, in the absence of factors such as fertility, differences in mating success etc., we mean viability, the probability to reach reproductive age, in distinction to absolute fitness which measures the overall reproductive success of a type.
- Equation (Equation 1) with the substitution of equation (Equation 3) has a long history, starting with the seminal work of Hilda Geiringer [?] who derived a version of the equation for a diploid population without selection. Versions of the equation were then rederived and discussed in [?], who used it to discuss the performance of recombinative Genetic Algorithms using Price’s theorem, showing that schemata were a natural consequence of recombination; and in [?] where the Building Block Hypothesis was examined and it was discussed under what circumstances recombination led to an increase in the effective fitness of a given genotype. Also, in the latter the relation to the concept of coarse graining was emphasized and discussed.
- This landscape corresponds to one optimal genotype with fitness , while other types have equal fitness, . It has been used extensively in molecular evolution in the context of the Eigen model [?], where the dynamics is naturally understood in terms of quasi-species.
- By “recombination-like” we mean any genomic change where one or more sub-sequences in one or more parental sequences are transferred to an offspring sequence. This is termed “generalized recombination” in [?] and comprehends unequal crossing over, transposition, translocation and related operations, as well as homologous recombination.
- Intuitively these modules will be formed by contiguous loci such as is natural for an exon or gene.
- The masks and correspond to cloning, where both offspring loci come from a single parent.
- The latter two landscape categories are known as
*deceptive*landscapes of Type I and Type II respectively in the Genetic Algorithm literature[?]. It has been proved [?] that Type I systems always converge to the global optimum whereas Type II systems converge to the optimum or double mutant depending on the population and recombination probability. - A good, although somewhat dated, review of the different potential mechanisms, and in particular Muller’s ratchet, by which recombination can be beneficial can be found in [?].
- In this case there is an initial linkage disequilibrium, i.e.,

. - Karlin, see for example [?] section vii, has shown that there are no stable polymorphisms in the model type considered in this paper.
- Save for the non-generic values and , there are no important qualitative changes as a function of the recombination probability.