This contribution is concerned with mathematical models for the dynamics of the genetic composition of populations evolving under recombination. Recombination is the genetic mechanism by which two parent individuals create the mixed type of their offspring during sexual reproduction. The corresponding models are large, nonlinear dynamical systems (for the deterministic treatment that applies in the infinite-population limit), or interacting particle systems (for the stochastic treatment required for finite populations). We review recent progress on these difficult problems. In particular, we present a closed solution of the deterministic continuous-time system, for the important special case of single crossovers; we extract an underlying linearity; we analyse how this carries over to the corresponding stochastic setting; and we provide a solution of the analogous deterministic discrete-time dynamics, in terms of its generalised eigenvalues and a simple recursion for the corresponding coefficients.
Deterministic and stochastic aspects of single-crossover
and stochastic aspects
of single-crossover recombination Baake]Ellen Baake \contact[firstname.lastname@example.org]Faculty of Technology, Bielefeld University, 33594 Bielefeld, Germany \theoremstyledefinition
rimary 92D10, 34L30; Secondary 37N25, 06A07, 60J25.
opulation genetics, recombination dynamics, Möbius linearisation and diagonalisation, correlation functions, Moran model.
Biological evolution is a complex phenomenon driven by various processes, such as muation and recombination of genetic material, reproduction of individuals, and selection of favourable types. The area of population genetics is concerned with how these processes shape and change the genetic structure of populations. Mathematical population genetics was founded in the 1920’s by Ronald Fisher, Sewall Wright, and John Haldane, and thus is among the oldest areas of mathematical biology. The reason for its continuing (and actually increasing) attractiveness for both mathematicians and biologists is at least twofold: Firstly, there is a true need for mathematical models and methods, since the outcome of evolution is impossible to predict (and, thus, today’s genetic data are impossible to analyse) without their help. Second, the processes of genetics lend themselves most naturally to a mathematical formulation and give rise to a wealth of fascinating new problems, concepts, and methods.
This contribution will focus on the phenomenon of recombination, in which two parent individuals are involved in creating the mixed type of their offspring during sexual reproduction. The essence of this process is illustrated in Fig. 1 and may be idealised and summarised as follows.
Genetic information is encoded in terms of sequences of finite length. Eggs and sperm (i.e., female and male germ cells or gametes) each carry a single copy of such a sequence. They go through the following life cycle: At fertilisation, two gametes meet randomly and unite, thus starting the life of a new individual, which is equipped with both the maternal and the paternal sequence. At maturity, this individual will generate its own germ cells. This process includes recombination, that is, the maternal and paternal sequences perform one or more crossovers and are cut and relinked accordingly, so that two ‘mixed’ sequences emerge. These are the new gametes and start the next round of fertilisation (by random mating within a large population).
Models of this process aim at describing the dynamics of the genetic composition of a population that goes through this life cycle repeatedly. These models come in various flavours: in discrete or continous time; with various assumptions about the crossover pattern; and, most importantly, in a deterministic or a stochastic formulation, depending on whether the population is assumed to be so large that stochastic fluctuations may be neglected. In any case, however, the resulting process appears difficult to treat, due to the large number of possible states and the nonlinearity generated by the random mixture of gametes. Nevertheless, a number of solution procedures have been discovered for the deterministic discrete-time setting [8, 10, 14], and the underlying mathematical structures were investigated within the framework of genetic algebras, see [18, 19, 22]. Quite generally, the solution relies on a certain nonlinear transformation (known as Haldane linearisation) from (gamete or type) frequencies to suitable correlation functions, which decouple from each other and decay geometrically. But if sequences of more than three sites are involved, this transformation must be constructed via recursions that involve the parameters of the recombination process, and is not available explicitly except in the trivial case of independent sites. For a review of the area, see [9, Ch. V.4].
In this contribution, we concentrate on a special case that is both biologically and mathematically relevant, namely, the situation in which at most one crossover happens at any given time. That is, only recombination events may occur that partition the sites of a sequence into two parts that correspond to the sites before and after a given crossover point. We analyse the resulting models in continuous time (both deterministic and stochastic), as well as in discrete time. For the deterministic continuous-time system (Section 2), a simple explicit solution can be given. This simplicity is due to some underlying linearity; actually, the system may even be diagonalised (via a nonlinear transformation). In Section 3, we consider the corresponding stochastic process (still in continuous time), namely, the Moran model with recombination. This also takes into account the resampling effect that comes about via random reproduction in a finite population. In particular, we investigate the relationship between the expectation of the Moran model and the solution of the deterministic continuous-time model. We finally tackle deterministic single-crossover dynamics in discrete time (Section 4). This setting implies additional dependencies, which become particularly transparent when the so-called ancestral recombination process is considered. A solution may still be given, but its coefficients must be determined recursively.
Altogether, it will turn out that the corresponding models, and their analysis, have various mathematical facettes that are intertwined with each other, such as differential equations, probability theory, and combinatorics.
2 Deterministic dynamics, continuous time
2.1 The model
We describe populations at the level of their gametes and thus identify gametes with individuals. Their genetic information is encoded in terms of a linear arrangement of sites, indexed by the set . For each site , there is a set of ‘letters’ that may possibly occur at that site. To allow for a convenient notation, we restrict ourselves to the simple but important case of finite sets ; for the full generality of arbitrary locally compact spaces , the reader is referred to  and .
A type is thus defined as a sequence
where is called the type space. By construction, is the -th coordinate of , and we define as the collection of coordinates with indices in , where is a subset of . A population is identified with a non-negative measure on . Namely, denotes the frequency of individuals of type and for ; we abbreviate as . The set of all nonnegative measures on is denoted by . If we define as the point measure on (i.e., for ), we can also write . We may, alternatively, interpret as the basis vector of that corresponds to (where a suitable ordering of types is implied, and is the number of elements in ); is thus identified with a vector in .
At this stage, frequencies need not be normalised; may simply be thought of as the size of the subpopulation of type , measured in units so large that it may be considered a continuous quantity. The corresponding normalised version (where is the total population size) is then a probability distribution on , and may be identified with a probability vector.
Recombination acts on the links between the sites; the links are collected into the set . We shall use Latin indices for the sites and Greek indices for the links, and the implicit rule will always be that is the link between sites and ; see Figure 2.
Let recombination happen in every individual, and at every link , at rate . More precisely, for every , every individual exchanges, at rate , the sites after link with those of a randomly chosen partner. Explicitly, if the ‘active’ and the partner individual are of types and , then the new pair has types and , where is the largest integer below (the smallest above ); see Fig. 3. Since every individual can occur as either the ‘active’ individual or as its randomly chosen partner, we have a total rate of for crossovers at link . For later use, we also define .
In order to formulate the corresponding model, let us introduce the projection operators , , via
i.e., is the canonical projection to the -th coordinate. Likewise, for any index set , one defines a projector
We shall frequently use the abbreviations and , as well as , . The projectors and may be thought of as cut and forget operators because they take the leading or trailing segment of a sequence , and forget about the rest.
Whereas the act on the types, we also need the induced mapping at the level of the population, namely,
where denotes the preimage under . The operation (where the dot is on the line) is the ‘pullback’ of w.r.t. ; so, is the marginal distribution of with respect to the sites in . In particular, is the marginal frequency of sequences prescribed at the sites before , and vice versa for the sites after .
Now, single-crossover recombination (at the level of the population) means the relinking of a randomly chosen leading segment with a randomly chosen trailing segment. We therefore introduce (elementary) recombination operators (or recombinators, for short), for , defined by
Here, the tensor product reflects the independent combination (i.e., the product measure) of the two marginals and . is therefore a cut and relink operator. may be understood as the population that emerges if all individuals of the population disintegrate into their leading and trailing segments and these are relinked randomly. Note that .
The recombination dynamics may thus be compactly written as
where is the identity operator. Note that (4) is a large, nonlinear system of ordinary differential equations (ODEs).
2.2 Solution of the ODE system
The solution of (4) relies on some elementary properties of our recombinators. Most importantly, they are idempotents and commute with each other, i.e.,
These properties are intuitively plausible: if the links before are already independent of those after due to a previous recombination event, then further recombination at that link does not change the situation; and if a product measure is formed with respect to two links and , the result does not depend on the order in which the links are affected. For the proof, we refer to [5, Prop. 2]; let us only mention here that it relies on the elementary fact that, for ,
that is, recombination at or after does not affect the marginal frequencies at sites before , and vice versa.
We now define composite recombinators as
Here, the product is to be read as composition; it is, indeed, a product if the recombinators are written in terms of their multilinear matrix representations, which is available in the case of finite types considered here (see ). By property (6), the order in the composition plays no role. Furthermore, (5) and (6) obviously entail for .
With this in hand, we can now state an explicit solution of our problem, namely,
For the proof, the reader is referred to [5, Thm. 2] or [3, Thm. 3] (the former article contains the original, the latter a shorter and more elegant version of the proof). Let us note that the coefficient functions can be interpreted probabilistically. Given an individual sequence in the population, is the probability that the set of links that have seen at least one crossover event until time is precisely the set (obviously, ). Note that the product structure of the implies independence of links, a decisive feature of the single-crossover dynamics in continuous time, as we shall see later on. Note also that, as , converges to the stationary state
in which all sites are independent.
2.3 Underlying linearity
The simplicity of the solution in Theorem 1 comes as a certain surprise. After all, explicit solutions to large, nonlinear ODE systems are rare – they are usually available for linear systems at best. For this reason, the recombination equation and its solution have already been taken up in the framework of functional analysis, where they have led to an extension of potential theory . We will now show that there is an underlying linear structure that is hidden behind the solution. It can be stated as follows, compare [5, Sec. 3.2] for details.
Let be a family of non-negative functions with , valid for any partition of the set and all , where . Assume further that these functions satisfy for any and . If and , one has the identity
which is then satisfied for all . \qed
Here, the upper index specifies the respective set of links. Clearly, the coefficient functions of Theorem 1 satisfy the conditions of Theorem 2. The result then means that the recombinators act linearly along the solutions (7) of the recombination equation (4). Theorem 2 thus has the consequence that, on , the forward flow of commutes with all recombinators, that is, for all and all .
But let us go one step further here. The conventional approach to solve the recombination dynamics consists in transforming the type frequencies to certain functions (known as principal components) that diagonalise the dynamics, see [8, 10, 18] and references therein for more. We will now show that, in continuous time, they have a particularly simple structure: they are given by certain correlation functions, known as linkage disequilibria (LDE) in biology, which play an important role in applications. They have a counterpart at the level of operators (on ). Namely, let us define LDE operators via
where the underdot indicates the summation variable. Note that maps into , the set of signed measures on . Eq. (9) leads to the inverse by the combinatorial Möbius inversion formula, see [1, Thm. 4.18]. We then have
If is the solution (7), the transformed quantities satisfy
See [5, Sec. 3.3].
Obviously, Eq. (10) is a system of decoupled, linear, homogeneous differential equations with the usual exponential solution. Note that this simple form emerged through the nonlinear transform (9) as applied to the solution of the coupled, nonlinear differential equation (4).
Suitable components of the signed measure may then be identified to work with in practice (see [5, 6] for details); they correspond to correlation functions of all orders and decouple and decay exponentially. These functions turn out to be particularly well-adapted to the problem since they rely on ordered partitions, in contrast to conventional LDE’s used elsewhere in population genetics, which rely on general partitions (see [9, Ch. V.4] for review).
3 Stochastic dynamics, continuous time
3.1 The model
The effect of finite population size in population genetics is, in continuous time, well captured by the Moran model. It describes a population of fixed size and takes into account the stochastic fluctuations due to random reproduction, which manifest themselves via a resampling effect (known as genetic drift in biology). More precisely, the finite-population counterpart of our deterministic model is the Moran model with single-crossover recombination. To simplify matters (and in order to clearly dissect the individual effects of recombination and resampling), we shall use the decoupled (or parallel) version of the model, which assumes that resampling and recombination occur independently of each other, as illustrated in Fig. 4. More precisely, in our finite population of fixed size , every individual experiences, independently of the others,
resampling at rate . The individual reproduces, the offspring inherits the parent’s type and replaces a randomly chosen individual (possibly its own parent).
recombination at (overall) rate at link . Every individual picks a random partner (maybe itself) at rate , and the pair exchanges the sites after link . That is, if the recombining individuals have types and , they are replaced by the two offspring individuals and , as in the deterministic case, and Fig. 3. As before, the per-capita rate of recombination at link is then , because both orderings of the individuals lead to the same type count in the population.
Note that the randomly chosen second individual (for resampling or recombination) may be the active individual itself; then, effectively, nothing happens. One might, for biological reasons, prefer to exclude these events by sampling from the remaining population only; but this means nothing but a change of time scale of order .
To formalise this verbal description of the process, let the state of the population at time be given by the collection (the random vector)
where is the number of individuals of type at time ; clearly, . We also use in the sense of a (random counting) measure, in analogy with (but keep in mind that is integer-valued and counts single individuals, whereas denotes continuous frequencies in an infinite population). The letter will be used to denote realisations of — but note that the symbols , and are not on equal footing ( and will continue to be types). The stochastic process is the continuous-time Markov chain on defined as follows. If the current state is , two types of transitions may occur:
(where is the point measure on , as before). Note that, in (12) and (12), transitions that leave are automatically excluded by the fact that the corresponding rates vanish. On the other hand, ‘empty transitions’ ( or ) are explicitly included (they occur if in resampling or recombination, and if or in recombination).
3.2 Connecting stochastic and deterministic models
Let us now explore the connection between the stochastic process on , its normalised version on , and the solution (Eq. (7)) of the differential equation. It is easy to see (and no surprise) that
with of (4). But this does not, per se, lead to a ‘closed’ differential equation for , because it is not clear whether can be written as a function of alone—after all, is nonlinear. In the absence of resampling, however, we have
See [6, Thm. 1 and Cor. 1]. The result again points to some underlying linearity, which, in the context of the stochastic model, should be connected to some kind of independence. Indeed, the key to the proof of Thm. 4 is a lemma concerning the independence of marginal processes. For , we introduce the ‘stretch’ of as
and look at the projection of the recombination process on non-overlapping stretches. This is the content of
Let be the recombination process without resampling (i.e., ). Let with . Then, and are conditionally (on ) independent Markov chains on and .
See [6, Lemma 1].
Let us now re-include resampling, at rate , and consider the stochastic process defined by both (12) and (12), where we add the upper index here to indicate the dependence on . Now, Lemma 5 and Thm. 4 are no longer valid. The processes and are still individually Markov, but their resampling events are coupled (replacement of by is always tied to replacement of by ). Hence the marginal processes fail to be independent, so that no equivalent of Lemma 5 holds.
Let us, therefore, change focus and consider the normalised version . In line with general folklore in population genetics, in the limit , the relative frequencies cease to fluctuate and are then given by the solution of the corresponding deterministic equation. More precisely, we have
Note that the convergence in (14) applies for any given , but need not carry over to . Indeed, if resampling is present, the population size required to get close to the deterministic solution is expected to grow over all bounds with increasing . This is because, for every finite , the Moran model with resampling and recombination is an absorbing Markov chain, which leads to fixation (i.e., to a homogeneous population of uniform type) in finite time with probability one (for the special case of just two types without recombination, the expected time is known to be of order if the initial frequencies are both [13, p. 93]). In sharp contrast, the deterministic system never loses any type, and the stationary state, the complete product measure (8), is, in a sense, even the most variable state accessible to the system. For increasing , finite populations stay close to the deterministic limit for an increasing length of time.
4 Discrete time
Let us return to the deterministic setting and consider the discrete-time version of our single-crossover dynamics (4), that is,
Here, the coefficients , , are the probabilities for a crossover at link in every generation (as opposed to the rates of the continuous-time setting). Consequently, we must have .
Based on the result for the continuous-time model, the solution is expected to be of the form
with non-negative , , , describing the (still unknown) coefficient functions arising from the dynamics. This representation of the solution was first stated by Geiringer . The coefficient functions will have the same probabilistic interpretation as the corresponding in the continuous-time model, so that is the probability that the links that have been involved in recombination until time are exactly those of the set .
But there is a crucial difference. Recall that, in continuous time, single crossovers imply independence of links, which is expressed in the product structure of the (see Thm. 1). This independence is lost in discrete time, where a crossover event at one link forbids any other cut at other links in the same time step. It is therefore not surprising that a closed solution is not available in this case. It will, however, turn out that a solution can be stated in terms of the (generalised) eigenvalues of the system (which are known explicitly), together with coefficients to be determined via a simple recursion. But it is rewarding to take a closer look at the dynamics first.
Let us introduce the following abbreviations:
and, for each ,
Furthermore, we set . The dynamics (15) is then reflected in the following dynamics of the coefficient functions:
For all and , the coefficient functions evolve according to
with initial condition . \qed
The above iteration is easily understood intuitively: A type resulting from recombination at link is composed of two segments and . These segments themselves may have been pieced together in previous recombination events already, and the iteration explains the possible cuts these segments may carry along. The first term in the product stands for the type delivering the leading segment (which may bring along arbitrary cuts in the trailing segment), the second for the type delivering the trailing one (here any leading segment is allowed). The term covers the case of no recombination.
Let us now have a closer look at the structure of the dependence between links in discrete time. To this end, note first that the set with partitions into , where
Cutting all links in decomposes the original system (of sites and links) into subsystems which are independent of each other from then on. In particular, the links in become independent of those in , for . The probability that none of these subsystems experiences any further recombination is
A most instructive way to detail the effect of dependence is the ancestral recombination process: start from an individual in the present population, let time run backwards and consider how this individual’s type has been pieced together from different fragments in the past. In the four-sites example of Fig. 5, the probability that exactly and have been cut reads
Here, the first (second) term corresponds to the possibility that link () is the first to be cut. Obviously, the two possibilities are not symmetric: If is the first to break, an additional factor of is required to guarantee that, at the time of the second recombination event (at ), the trailing segment (sites 2 and 3) remains intact while the leading segment (sites 0 and 1) is cut.
Despite these complications, the discrete-time dynamics can again be solved, even directly at the level of the , albeit slightly less explicitly than in continuous time. Indeed, it may be shown (and will be detailed in a forthcoming paper) that the coefficient functions have the form
where the upper index has again been added to indicate the dependence on the system. The coefficents () are defined recursively as follows. For ,
Together with the initial value , this may be solved recursively.
5 Concluding remarks and outlook
The results presented here can naturally only represent a narrow segment from a large area with lively recent and current activities. Let us close this contribution by mentioning some important further directions in the context of recombination.
Our restriction to single crossovers provided a starting point with a uniquely transparent structure (mainly due to the independence of links in continuous time). However, arbitrary recombination patterns (which partition the set of links into two arbitrary parts) can also be dealt with, as has been done for the deterministic case in [18, 10]. The underlying mathematical structure will be further investigated in a forthcoming paper, for both the deterministic and the stochastic models.
Above, genetic material was exchanged reciprocally at recombination events, so that the length of each sequence remains constant. But sequences may also shift relative to each other before cutting and relinking (so-called unequal crossover), which entails changes in length, see  and references therein for more.
The most important aspect of modern population genetics is the backward-in-time point of view. This is natural because evolution is mainly a historical science and today’s researchers try to infer the past from samples of individuals taken from present-day populations. We have hinted at this with our version of an ancestral recombination process, but would like to emphasise that this is only a toy version. The full version of this process also takes into account resampling (as in Sec. 3, with ) and aims at the law of genealogies of samples from finite populations. This point of view was introduced by Hudson . The fundamental concept here is the ancestral recombination graph: a branching-coalescing graph, where branching (backwards in time) comes about by recombination (as in Fig. 5), but lines may also coalesce where two individuals go back to a common ancestor (this corresponds to a resampling event forward in time). For recent introductions into this topic, see [11, Chap. 3], [15, Chap. 5], or [23, Chap. 7]; these texts also contain overviews of how recombination may be inferred from genomic datasets.
Last not least, recombination and resampling are but two of the various processes that act on genes in populations. Further inclusion of mutation and/or selection leads to a wealth of challenging problems, whose investigation has stimulated the exploration of new mathematical structures, concepts, and methods; let us only mention , , and  as recent examples. This development is expected to continue and intensify in the years to come – not least because it concerns the processes that have shaped present-day genomes.
-  M. Aigner, Combinatorial Theory, Springer, Berlin, 1979. Reprint 1997.
-  E. Baake, Mutation and recombination with tight linkage, J. Math. Biol. 42 (2001), 455–488.
-  M. Baake, Recombination semigroups on measure spaces, Monatsh. Math. 146 (2005), 267–278 and 150 (2007), 83–84 (Addendum). arXiv:math.CA/0506099.
-  M. Baake, Repeat distributions from unequal crossovers, Banach Center Publ. 80 (2008), 53–70. arXiv:0803.1270.
-  M. Baake and E. Baake, An exactly solved model for mutation, recombination and selection, Canad. J. Math. 55 (2003), 3–41 and 60 (2008), 264–265 (Erratum). arXiv:math.CA/0210422.
-  E. Baake and I. Herms, Single-crossover dynamics: Finite versus infinite populations, Bull. Math. Biol. 70 (2008), 603–624. arXiv:q-bio/0612024.
-  N. H. Barton, A. M. Etheridge, and A. K. Sturm, Coalescence in a random background, Ann. Appl. Prob. 14 (2004), 754-785. arXiv:math/0406174.
-  J. Bennett, On the theory of random mating, Ann. Hum. Genet. 18 (1954), 311–317.
-  R. Bürger, The Mathematical Theory of Selection, Recombination, and Mutation, Wiley, Chichester, 2000.
-  K. Dawson, The evolution of a population under recombination: How to linearize the dynamics, Linear Algebra Appl. 348 (2002), 115–137.
-  R. Durrett, Probability Models for DNA Sequence Evolution, 2nd ed., Springer, New York, 2008.
-  S. N. Ethier and T. G. Kurtz, Markov Processes – Characterization and Convergence, Wiley, New York, 1986. Reprint 2005.
-  W. Ewens, Mathematical Population Genetics, Springer, Berlin, 2nd ed., 2004.
-  H. Geiringer, On the probability theory of linkage in Mendelian heredity, Ann. Math. Statist. 15 (1944), 25–57.
-  J. Hein, M. H. Schierup, and C. Wiuf, Gene Genealogies, Variation and Evolution, Oxford University Press, Oxford, 2005.
-  R. R. Hudson, Properties of a neutral allele model with intragenic recombination, Theor. Pop. Biol. 23 (1983), 183–201.
-  P.A. Jenkins and Y.S. Song, An asymptotic sampling formula for the coalescent with recombination, Ann. Appl. Prob., in press.
-  Y. Lyubich, Mathematical Structures in Population Genetics, Springer, New York, 1992.
-  D. McHale and G. Ringwood, Haldane linearization of baric algebras, J. London Math. Soc. 28 (1983), 17–26.
-  P. Pfaffelhuber, B. Haubold, and A. Wakolbinger, Approximate genealogies under genetic hitchhiking, Genetics 174 (2006), 1995-2008.
-  E. Popa, Some remarks on a nonlinear semigroup acting on positive measures, in: O. Carja and I. I. Vrabie, eds., Applied Analysis and Differential Equations, pp. 308–319, World Scientific, Singapore, 2007.
-  G. Ringwood, Hypergeometric algebras and Mendelian genetics, Nieuw Archief v. Wiskunde 3 (1985), 69–83.
-  J. Wakeley, Coalescent Theory. Roberts, Greenwood Village, CO, 2009.
-  U. von Wangenheim, E. Baake, and M. Baake, Single-crossover dynamics in discrete time, J. Math. Biol., in press. arXiv:0906.1678.