Genealogies of two linked neutral loci after a selective sweep in a large population of stochastically varying size
Abstract.
We study the impact of a hard selective sweep on the genealogy of partially linked neutral loci in the vicinity of the positively selected allele. We consider a sexual population of stochastically varying size and, focusing on two neighboring loci, derive an approximate formula for the neutral genealogy of a sample of individuals taken at the end of the sweep. Individuals are characterized by ecological parameters depending on their genetic type and governing their growth rate and interactions with other individuals (competition). As a consequence, the ”fitness” of an individual depends on the population state and is not an intrinsic characteristic of individuals. We provide a deep insight into the dynamics of the mutant and wild type populations during the different stages of a selective sweep.
Key words and phrases:
Birth and death process; Coalescent; Coupling; Ecoevolution; Genetic hitchhiking; Selective sweep2010 Mathematics Subject Classification:
92D25, 60J80, 60J27, 92D15, 60F15.Introduction
We study the hitchhiking effect of a beneficial mutation in a sexual haploid population of stochastically varying size. We assume that a mutation occurs in one individual of a monomorphic population and that individuals carrying the new allele are better adapted to the current environment and spread in the population. We suppose that the mutant allele eventually replaces the resident one, , and study the influence of this fixation on the neutral gene genealogy of a sample taken at the end of the selective sweep. That is, in each sampled individual we consider the same set of partially linked loci including the locus where the advantageous mutation occurred. We then trace back the ancestral lineages of all loci in the sample until the beginning of the sweep and update the genetic relationships whenever a coalescence or a recombination changes the ancestry of one or several loci. Our main result is the derivation of a sampling formula for the ancestral partition of two neutral loci situated in the vicinity of the selected allele.
The first studies of hitchhiking, initiated by Maynard Smith and Haigh [20], have modeled the mutant population size as the solution of a deterministic logistic equation [16, 13, 22, 21]. Barton [1] was the first to point out the importance of the stochasticity of the mutant population size. Following this paper, a series of works took into account this randomness during the sweep. In [9, 18] Schweinsberg and Durrett based their analysis on a Moran model with selection and recombination, while Etheridge and coauthors [10] worked with the diffusion limit of such discrete population models. Then BrinkSpalink [2], Pfaffelhuber and Studeny [17] and Leocard [14] extended the respective findings of these two approaches for the ancestry of one neutral locus to the twolocus (resp. multiplelocus) case.
However, in all these models, the population size was constant and each individual had a “fitness” only dependent on its type and not on the population state. The fundamental idea of Darwin is that the individual traits have an influence on the interactions between individuals, which in turn generate selection on the different traits. In this paper we aim at modeling precisely these interactions by extending the model introduced in [19] where the author considered only one neutral locus. Such an ecoevolutionary approach has been introduced by Metz and coauthors [15] and has been made rigorous in the seminal paper of Fournier and Méléard [12]. Then it was further developed by Champagnat, Méléard and coauthors (see [3, 5, 4] and references therein) for the haploid asexual case and by Collet, Méléard and Metz [6] and Coron and coauthors [7] for the diploid sexual case.
The population dynamics, described in Section 1, is a multitype birth and death Markov process with competition. We represent the carrying capacity of the underlying environment by a scaling parameter and state results in the limit for large . In [3] it was shown that such kind of invasion processes can be divided into three phases (see Figure 2): an initial phase in which the fraction of individuals does not exceed a fixed value and where the dynamics of the wild type population is nearly undisturbed by the invading type. A second phase where both types account for a nonnegligible percentage of the population and where the dynamics of the population can be well approximated by a deterministic competitive LotkaVolterra system. And finally a third phase where the roles of the types are interchanged and the wild type population is near extinction. The durations of the first and third phases of the selective sweep are of order whereas the second phase only lasts an amount of time of order 1. This three phases decomposition is commonly encountered in population genetics models and dates back to [16].
In Section 3 we precisely describe these three phases and introduce two couplings of the population process, key tools to study the dynamics of the  and populations. Section 4 is devoted to the proofs of the main theorems on the ancestral partition of the two neutral alleles. Sections 5 to 7 are dedicated to the proofs of auxiliary statements. In Section 2 we compare our findings with previous results. Finally, we state technical results needed in the proofs in the Appendix.
1. Model and results
We consider a three locus model: one locus under selection, , with alleles in and two neighboring neutral loci and with alleles in the finite sets and respectively. We denote by the type space. Two geometric alignments are possible: either the two neutral loci are adjacent (geometry ), or they are separated by the selected locus (geometry ). We introduce the model and notations for the adjacent geometry, their analogs for the separated one can be deduced in a straightforward manner.
Whenever a reproduction event takes place, recombinations between and or between and occur independently with probabilities and , respectively. These probabilities depend on the parameter , representing the environment’s carrying capacity, but for the purpose of readability we do not indicate this dependence. We assume a regime of weak recombination:
(1.1) 
This is motivated by Theorem 2 in [19] which states that this is the good scale to observe a signature on the neutral allele distribution. If the recombination probabilities are larger (neutral loci more distant from the selected locus), there are many recombinations and the sweep does not modify the neutral diversity at these sites. Recombinations may lead to a mixing of the parental genetic material in the newborn, and hence, parents with types and in can generate the following offspring:
We will see in the sequel that the probability to witness a birth event with two simultaneous recombinations in the neutral genealogy of a uniformly chosen individual is very small.
As we assume the loci and to be neutral, the ecological parameters of an individual only depend on the allele at the locus under selection. Let us denote by the fertility of an individual with type . In the spirit of [6], such an individual gives birth at rate (female role), and has a probability proportional to to be chosen as the father in a given birth event (male role). Denoting the complementary type of the allele by we get the following result for the birth rate of individuals of type :
(1.2) 
where (resp. ) denotes the current number of individuals (resp. individuals) and is the current state of the population. An individual can die either from a natural death (rate ), or from typedependent competition: the parameter models the impact an individual of type has on an individual of type , where . The strength of the competition also depends on the carrying capacity . This results in the total death rate of individuals carrying the alleles :
(1.3) 
Hence the population process
where denotes the number of individuals at time , is a multitype birth and death process with rates given in (1.2) and (1.3). We will often work with the trait population process , where denotes the number of individuals at time . This is also a birth and death process with birth and death rates given by:
(1.4)  
As a quantity summarizing the advantage or disadvantage a mutant with allele type has in an population at equilibrium, we introduce the socalled invasion fitness through
(1.5) 
where the equilibrium density is defined by
(1.6) 
The role of the invasion fitness and the definition of the equilibrium density follow from the properties of the twodimensional competitive LotkaVolterra system:
(1.7) 
If we assume
(1.8) 
then is the equilibrium size of a monomorphic population and the system (1.7) has a unique stable equilibrium and two unstable steady states and . Thanks to Theorem 2.1 p. 456 in [11] we can prove that if and are of order and is large, the rescaled process is very close to the solution of (1.7) during any finite time interval. The invasion fitness corresponds to the per capita initial growth rate of the mutant when it appears in a monomorphic population of individuals at their equilibrium size . Hence the dynamics of the allele is very dependent on the properties of the system (1.7) and it is proven in [3] that under Condition (1.8) one mutant has a positive probability to fix in the population and replace a wild type . More precisely, if we use the convention
(1.9) 
Equation (39) in [3] states that
(1.10) 
where is called the rescaled invasion fitness, and the extinction time of the population and the event of fixation of the allele are rigorously defined as follows:
(1.11) 
From this point onward, we fix in . We aim at quantifying the effect of the selective sweep on the neutral diversity. Our method consists in tracing back the neutral genealogies of individuals sampled uniformly at the end of the sweep (time ) until time . Two event types (see Definition 4.1) may affect the relationships of the sampled neutral alleles: coalescences correspond to the merging of the neutral genealogies of two individuals at one or two neutral loci, and recombinations redistribute the selected and neutral alleles of one individual into two groups carried by its two parents. We will represent the neutral genealogies by a partition which belongs to the set of marked partitions of with (at most) one block distinguished by the mark , which will correspond to the descendants of the original mutant . In this notation and are the neutral alleles at loci and of the th sampled individual. Let us define rigorously the random partition :
Definition 1.1.
Sample individuals uniformly and without replacement at the end of the sweep (time ). Follow the genealogies of the first and second neutral alleles of the th sampled individual, and for . Then the partition is defined as follows: each block of the partition is composed of all those neutral alleles which originate from the same individual alive at the beginning of the sweep; the block containing the descendants of the mutant (if such a block exists) is distinguished by the mark .
We will show in Theorems 1 and 2 that when is large the partition belongs with a probability close to one to a subset of , which is defined as follows:
Definition 1.2.
is the subset of consisting of those partitions whose unmarked blocks (if there are any) are either singletons or pairs of the form for one .
Example 1.
In the example represented in Figure 1, the marked partition belongs to :
For a partition , we define for some possible ancestral relationships the number of individuals in the sample whose two neutral loci are related in that particular way:
Definition 1.3.
Let and . Then we set:

such that and belong to the marked block

such that belongs to the marked block and is an unmarked block

such that belongs to the marked block and is an unmarked block

such that is an unmarked block

such that and are two distinct unmarked blocks
To express the limit distribution of the partition we need to introduce:
(1.12) 
where the invasion fitnesses have been defined in (1.5). We did not make any assumption on the sign of , but can be written in the form for so that it is well defined and nonnegative. It is easy to check that . The forms of , and are intuitive (see comments of Proposition 1).The form of is more complex to explain and results from a combination of different possible genealogical scenarios during the first phase. We now define five nonnegative numbers which will quantify the law of for large in Theorem 1:
(1.13)  
Note that . Finally, we introduce an assumption which summarizes all the assumptions made in this work:
Assumption 1.
Theorem 1 (Geometry ).
Under Assumption 1, we have for every
Notice that when is large, belongs to with a probability close to one, and that
is a probability on (depending on ). Moreover, this result implies that the sampled individuals have asymptotically independent neutral genealogies. With high probability, the neutral alleles of a given sampled individual either originate from the first mutant and belong to the marked block, or escape the sweep and originate from an individual. In this case they belong to an unmarked block which is of the form , or , according to Definition 1.3. As a consequence, if some neutral alleles of two distinct sampled individuals escape the sweep, they originate from distinct individuals with high probability. However, the genealogies of the two neutral alleles of a given individual are not independent. For example the probability that and escape the sweep is ; the probability that (resp. ) escapes the sweep is (resp. ), and for every such that
This is due to the fact that if (backwards in time) a recombination first occurs between and , the neutral allele at , linked to , also escapes the sweep.
As the term does not tend to when goes to infinity under Condition (1.1), the only possibility to have an equality in the limit is the case where
or in other words when the probability to see a recombination between and is negligible.
Let us now consider the separated geometry, :
Theorem 2 (Geometry ).
Under Assumption 1, we have for every
Again the neutral genealogies of the sampled individuals are asymptotically independent. Furthermore, we have independence between the neutral loci. Indeed Theorem 2 means that a neutral allele at locus escapes the sweep with probability independently of all other neutral alleles, including the allele at the other neutral locus of the same individual. This is due to the fact that in the separated geometry a recombination between and one neutral locus has no impact on the genetic background of the allele at the other neutral locus. Note in particular that there is no block of the form in the limit partition, as the two neutral alleles have a very small probability to recombine at the same time.
2. Comparison with previous work
In [18] the authors gave an approximate sampling formula for the genealogy of one neutral locus
during a selective sweep. The population evolved as a twolocus modified Moran model
with recombination, selection, and in particular constant population size.
They introduced the fitness of the mutant as follows: when one of the iid exponential clocks
of the living individuals rings, one picks two individuals uniformly at random (with replacement),
one dies, and the other one gives birth. A replacement of an individual by an
individual is rejected with probability . In this case, nothing happens.
In [19], the author studied the one neutral locus version of the here presented model.
It was shown that the ancestral relationships in a sample taken at the end of the sweep correspond to
the ones derived in [18]
when we equal the fitness of [18] and the rescaled invasion fitness
and when we have the equality (in this case the first and third phases have the same duration,
).
In [2], the author generalized the model introduced in [18] towards two neutral loci and used
similar methods to derive a corresponding statement for the genealogy of a sample taken at the end of the sweep.
If we however make the analogous comparison and try to
match our result for the adjacent geometry with the statement from [2],
we observe an interesting phenomenon:
the probabilities of the different types of ancestry only coincide if the birth rates of  and individuals are the same,
that is, if .
In biology, the fitness describes the ability to both survive and reproduce, and can be defined by the average contribution of an individual with a given genotype to the gene pool of
the next generation. Hence a mutation which affects the fitness of an individual
in a given environment can either act on the fertility ( in our model), or on the death rate,
intrinsic () or by competition (), or on both.
Our result is comparable to that of [18] if the mutation only affects the death rate
(and still if ).
In [17], instead of a birth and death process, the authors modeled the population with a structured coalescent. It is shown that this process can be approximated by a marked Yule tree where the different marks are realized by Poisson processes and indicate a recombination of one or two loci into the wild type background. The impact of the third phase is taken into account by a certain refinement prior to the beginning of the coalescent which leads to the same effect of splitting of the two neutral loci as it is seen here. We again find similarities with our results when . In contrast, the techniques and precision used in [17] yield that coalescent events with individuals cannot be ignored, that is, there are neutral loci of different individuals from the sample which have the same typeancestor. The structure of the sample is therefore different from our results here. Notice that it is also the case of the second approximate sampling formula stated in [18], which is more precise than the first one.
3. Dynamics of the sweep and couplings
3.1. Description of the three phases
We only need to focus on the trajectories of the population process where the mutant allele goes to fixation and replaces the resident allele . Champagnat has described these trajectories in [3] and in particular divided the sweep into three phases with distinct  and population dynamics (see Figure 2). In the sequel, will be a positive real number independent of , as small as needed for the different approximations to hold. Moreover, from this point onward we will write (resp. ) instead of (resp. ) and instead of for the sake of readability.
First phase
The resident population size stays close to its equilibrium value as long as the mutant population size has not hit : if we introduce the finite subset of
(3.1) 
and the stopping times and , which denote respectively the hitting time of by the mutant population size and the exit time of by the resident population size,
(3.2) 
then we can deduce from [3] (see Equations (A.5) and (A.6) in [19] for the details of the derivation) that the events , and are very close:
(3.3) 
for a finite and small enough, where we recall convention (1.9). In this context, is the symmetric difference: for two sets and , . From this point onwards, ”first phase” will denote the time interval when the population size is smaller than .
Second phase
When and are of order , the rescaled population process is well approximated by the LotkaVolterra system (1.7). Moreover, under Condition (1.8) the system (1.7) has a unique attracting equilibrium for initial conditions satisfying , where has been defined in (1.6). In particular, if we introduce for the notation,
(3.4) 
then Theorem 3 (b) in [3] implies:
(3.5) 
for every , where
(3.6)  
(3.7)  
In the sequel, ”second phase” will denote the time interval when the population process is close to the solution of the system (1.7).
Third phase
Equation (3.5) also implies that
(3.8) 
where
(3.9) 
The ”third phase”, which corresponds to the time interval , can be seen as the symmetric counterpart of the first phase, where the roles of and are interchanged: during the extinction of the population, the population size stays close to its equilibrium value .
Let us introduce the positive real number and the finite subset of
(3.10) 
The times and are two stopping times for the process restarted after the second phase and denote respectively the hitting times of by the population for , and the exit time of by the population during the third phase,
(3.11) 
If we define the event
(3.12) 
we get from the proof of Lemma 3 in [3] that for a finite and small enough,
(3.13) 
To summarize, the fixation event is very close to the following succession of events:

The population size hits before the population size has escaped the vicinity of its equilibrium (first phase)

The rescaled population process is close to the deterministic competitive LotkaVolterra system during the second phase

The population size gets extinct before hitting and before the population size has escaped the vicinity of its equilibrium (third phase)
3.2. Couplings for the first and third phases
We are interested in the law of the neutral genealogies on the event . Equations (3.3) and (3.13) imply that it is enough to concentrate our attention on the event , but the dynamics of the population process conditionally on this event is complex to study. Indeed it boils down to studying the dynamics of a process conditioned on a future event ( for the first phase and for the third one). Hence the idea is to couple the population process with two processes, and , whose laws are easier to study. These processes will satisfy:
(3.14) 
and
(3.15) 
Let be in and be in . Denote the component of the population state:
(3.16) 
where is the canonical basis of . We are now able to introduce a process needed to describe the couplings:
Definition 3.1.
We denote by Moran process of type with recombination a process with values in , initial state , and the following dynamics:

After an exponential time with parameter we pick uniformly and with replacement three individuals and draw a Bernoulli variable with parameter

The first individual dies, the second one gives birth to an individual carrying its alleles at loci and , the third one is the potential second parent

If , there is no recombination and the allele at locus of the newborn is also inherited from the second individual; if there is a recombination between and and the newborn inherits its second neutral allele from the third individual

We again draw an exponential variable with parameter and restart the procedure
Coupling with : and are equal up to time ; after this time the individuals in the population process follow a Moran process with recombination independent of the individuals. Let be in . We let the population evolve as if the individuals were interacting with individuals with genotype :
(3.17) 
where has been defined in Definition 3.1 and are independent Poisson Point processes with density , also independent of . The reason for the construction of such a coupling is that we need to control the population size and the number of births of individuals during the first phase in Section 5. With the process such control is achieved easier.
Coupling with : we assume that from (3.12) holds; and are equal up to time . Then the individuals in the population process follow a Moran process with recombination independent of the individuals, and each population evolves as a birth and death process with individual birth and death rates and , independent of the individuals and the populations with :
(3.18) 
where has been defined in Definition 3.1 and is independent of the sequence of
independent Poisson
measures , with intensity .
The population size and the number of
births of individuals will be easy to control for the process during the third phase, and again we will need such control in
Section 5.
Inequality (3.14) follows from (3.3). Moreover, from the proof of Lemma 3 in [3] we know that
for a finite and small enough. Adding (3.13) we get that (3.15) is also satisfied. Hence we will study the processes and and deduce properties of the dynamics of the process during the first and third phases.
4. Proofs of the main results
As the proof of Theorem 2 is simpler than this of Theorem 1, and follows essentially the same ideas, we only prove Theorem 1.
4.1. Events impacting the genealogies in each phase
Let us now summarize the results on the genealogies for the three successive phases of the sweep that we will derive in
Sections 6 and 7.
First phase: As explained in the previous section, we work with the process to study the first phase. Let us introduce the jump times of :
(4.1) 
The number of jumps during the first phase is denoted by :
(4.2) 
Coalescence and recombination events are defined as follows (see Figure 3):
Definition 4.1.
Sample two distinct individuals at time and denote and their type.
We say that and coalesce at time if they are carried by two distinct individuals at time and by the same individual at time . Seen forwards in time it corresponds to a birth and hence a copy of the neutral allele. Seen backwards in time it corresponds to the fusion of two neutral alleles into one, carried by one parent of the newborn. We define in the same way coalescent events at locus (resp. loci and ) for alleles and (resp. allele pairings and ).
We say that (and/or ) recombines at time from the  to the population if the individual carrying the allele (and/or ) at time is a newborn, carries the allele inherited from it first parent, and has inherited its allele (and/or ) from a different individual carrying allele .
We are only interested in recombinations which entail new associations of alleles. In particular we will not consider the simultaneous recombinations of a pair within the population.
Let us now describe the genealogical scenarios which modify the ancestral relationships between the neutral alleles of one individual and occur with positive probability when is large. We first focus on the first phase and pick uniformly an individual from the population at time . We introduce: