A large offspring number diploid biparental multilocus population model of Moran type is our object of study. At each timestep, a pair of diploid individuals drawn uniformly at random contribute offspring to the population. The number of offspring can be large relative to the total population size. Similar ‘heavily skewed’ reproduction mechanisms have been considered by various authors recently, cf. e.g. Eldon and Wakeley (2006, 2008), and reviewed by Hedgecock and Pudovkin (2011). Each diploid parental individual contributes exactly one chromosome to each diploid offspring, and hence ancestral lineages can only coalesce when in distinct individuals. A separation of timescales phenomenon is thus observed. A result of Möhle (1998) is extended to obtain convergence of the ancestral process to an ancestral recombination graph necessarily admitting simultaneous multiple mergers of ancestral lineages. The usual ancestral recombination graph is obtained as a special case of our model when the parents contribute only one offspring to the population each time.
Due to diploidy and large offspring numbers, novel effects appear. For example, the marginal genealogy at each locus admits simultaneous multiple mergers in up to four groups, and different loci remain substantially correlated even as the recombination rate grows large. Thus, genealogies for loci far apart on the same chromosome remain correlated. Correlation in coalescence times for two loci is derived and shown to be a function of the coalescence parameters of our model. Extending the observations by Eldon and Wakeley (2008), predictions of linkage disequilibrium are shown to be functions of the reproduction parameters of our model, in addition to the recombination rate. Correlations in ratios of coalescence times between loci can be high, even when the recombination rate is high and sample size is large, in large offspring number populations, as suggested by simulations, hinting at how to distinguish between different population models.
An ancestral recombination graph for diploid populations with skewed offspring distribution
Matthias Birkner, Jochen Blath, Bjarki Eldon
Institut für Mathematik, Johannes-Gutenberg-Universität Mainz, 55099 Mainz, Germany
Institut für Mathematik, Technische Universität Berlin, 10623 Berlin, Germany
An ancestral recombination graph admitting simultaneous multiple mergers
ancestral recombination graph, diploidy, skewed offspring distribution, simultaneous multiple merger coalescent processes, correlation in coalescence times, linkage disequilibrium, ratios of coalescence times.
Institute für Mathematik, Technische Universität Berlin, Strasse des 17. Juni 136, 10623 Berlin, Germany
office: +49 303 142 5762
Diploidy, in which each offspring receives two sets of chromosomes, one from each of two distinct diploid parents, is fairly common among natural populations. Mathematical models in population genetics tend to assume, however, that all individuals in a population are haploid, simplifying the mathematics. Mendel’s Laws describe the mechanism of inheritance as composed of two main steps, equal segregation (First Law), and independent assortment (Second Law). The First Law proclaims gametes are haploid, i.e. carry only one of each pair of homologous chromosomes. Most models in population genetics are thus models of chromosomes, or gene copies. Mendel’s Second Law proclaims independent assortment of alleles at different genes, or loci, into gametes. Linkage of alleles on chromosomes, resulting in non-random association of alleles at different loci into gametes, is of course an important exception to the Second Law.
Coalescent processes (Kingman, 1982a, b; Hudson, 1983b; Tajima, 1983) describe the ancestral relations of chromosomes (or gene copies) drawn from a natural population. The coalescent was initially derived from a Cannings (1974) haploid exchangeable population model. Related ancestral processes take into account population structure (Notohara, 1990; Herbots, 1997), selection (Krone and Neuhauser, 1997; Neuhauser and Krone, 1997; Etheridge et al., 2010), and recombination between linked loci (Hudson, 1983a; Griffiths, 1991; Griffiths and Marjoram, 1997). The coalescent has proved to be an important advance in theoretical population genetics, and a valuable tool for inference of evolutionary histories of populations.
Ancestral recombination graphs (ARG) (Hudson, 1983a; Griffiths, 1991; Griffiths and Marjoram, 1997) trace ancestral lineages of gene copies at linked loci, in which linkage is broken up by recombination. An ARG is a branching-coalescing graph, in which recombination leads to branching of ancestral chromosomes, and coalescence to segments rejoining. Coalescence events in an ARG may not lead to coalescence of gene copies at individual loci. An example ARG for two linked loci is given below, labelled as , with notation borrowed from Durrett (2002). The labels and refer to the two alleles (types) at locus 1 and 2, respectively. A single chromosome with two linked alleles is denoted by , while chromosomes carrying ancestral alleles at only one locus are denoted and . When coalescence occurs at either locus, the number of alleles at the corresponding locus is reduced by one. The absorbing state, either or , is reached when alleles at both loci have coalesced.
In , the first transition is a recombination, denoted by , followed by a coalescence , in which the two alleles at locus 1 coalesce. Graph serves to illustrate two important concepts we will be concerned with, namely correlation in coalescence times between alleles at different loci, and the restriction to binary mergers of ancestral lineages.
Correlation in coalescence times between types at different loci follows from linkage. Alleles at different loci can become associated due to a variety of factors, including changes in population size, natural selection, and population structure. Within-generation fecundity variance polymorphism induces correlation between a neutral locus and the locus associated with the fecundity variance (Taylor, 2009). Sweepstake-style reproduction (Hedgecock et al., 1982; Hedgecock, 1994; Beckenbach, 1994; Avise et al., 1988; Palumbi and Wilson, 1990; Árnason, 2004; Hedgecock and Pudovkin, 2011), in which few individuals produce most of the offspring, has also been shown to induce correlation in coalescence times between loci (Eldon and Wakeley, 2008). Understanding genome-wide correlations in coalescence times becomes ever more important as multi-loci genetic data becomes ubiquitous.
The ARG exemplified by is characterised by admitting only binary mergers of ancestral lineages, i.e. exactly two lineages coalesce in each coalescence event. The restriction to binary mergers follows from bounds on the underlying offspring distribution, in which the probability of large offspring numbers becomes negligible in a large population (Kingman, 1982a, b). Sweepstake-style reproduction, in which few individuals contribute very many offspring to the population, have been suggested to explain the ‘shallow’ gene genealogy observed for many marine organisms (Hedgecock et al., 1982; Hedgecock, 1994; Avise et al., 1988; Palumbi and Wilson, 1990; Beckenbach, 1994; Árnason, 2004; Hedgecock and Pudovkin, 2011). Large offspring number models are models of extremely high variance in individual reproductive output. Namely, individuals can have very many offspring, or up to the order of the population size with non-negligible probability (Schweinsberg, 2003; Eldon and Wakeley, 2006; Sargsyan and Wakeley, 2008; Sagitov, 2003; Birkner and Blath, 2009). Such models do predict shallow gene genealogies, and can be shown to give better fit to genetic data obtained from Atlantic cod (Árnason, 2004) than the Kingman coalescent (Birkner and Blath, 2008; Birkner et al., 2011; Eldon, 2011; Steinrücken et al., 2012). Different large offspring number models will no doubt be appropriate for different populations, and the identification of large offspring number population models for each population is an open problem. For the sake of simplicity and mathematical tractability, the simple large offspring number model considered by Eldon and Wakeley (2006) will be adapted to our situation.
The coalescent processes derived from large offspring number models belong to a large class of multiple merger coalescent processes introduced by Donnelly and Kurtz (1999), Pitman (1999), and Sagitov (1999). Multiple merger coalescent processes (-coalescents), as the name implies, admit multiple mergers of ancestral lineages in each coalescence event, in which any number of active ancestral lineages can coalesce, and at most one such merger occurs each time. In simultaneous multiple merger coalescent processes (Möhle and Sagitov, 2001; Schweinsberg, 2000a), any number of multiple mergers can occur each time, i.e. distinct groups of active ancestral lineages can coalesce each time. The ancestral recombination graph derived from our diploid large offspring number model admits simultaneous multiple mergers of ancestral lineages, as exemplified in . The last transition in is a simultaneous multiple merger, in which the two types at each locus coalesce to separate ancestral chromosomes.
In order to investigate correlations in coalescence times among loci due to skewed offspring distribution, we formally derive an ancestral recombination graph, or a coalescent process for many linked loci, from our diploid large offspring number model. The key to the proof of convergence to an ancestral recombination graph from our diploid model lies in resolving the separation of timescales phenomenon we observe. Following Mendel’s Laws, the two chromosomes of an offspring come from distinct diploid parents. Chromosomes can therefore only coalesce when in distinct individuals. The ancestral process will consist of two phases, a dispersion phase occurring on a ‘fast’ timescale, and a coalescence and recombination phase occurring on a ‘slow’ timescale. In the dispersion phase, chromosomes paired together in diploid individuals disperse into distinct individuals. Coalescence and recombination will only occur on the slow timescale. Similar separation of timescales issues arise in models of populations structured into infinitely many subpopulations (demes) (Taylor and Véber, 2009). When viewing the diploid individuals in our model as ‘demes’, our scenario departs from those describing structured populations by allowing only active ancestral lineages residing in separate ‘demes’ to coalesce. A simple extension of a result of Möhle (1998) yields convergence in our case.
The limiting process we formally obtain is an ancestral recombination graph for many loci admitting simultaneous multiple mergers of ancestral chromosomes (lineages). In simultaneous multiple merger coalescent processes, so-called -coalescents, different groups of active ancestral lineages can coalesce to different ancestors at the same time. Such coalescent processes were first studied as more abstract mathematical objects by Schweinsberg (2000a), and derived from general single-locus population models by several authors (Möhle and Sagitov, 2001; Sagitov, 2003; Sargsyan and Wakeley, 2008; Birkner et al., 2009). A -coalescent with necessarily up to quadruple simultaneous multiple mergers arises at each marginal locus (ie. considering each locus separately) in our model, since four parental chromosomes are involved in each reproduction event. This structure is intrinsically owed to our diploidy assumptions.
Formulas for the correlation in coalescence times between two alleles at two loci are obtained using our ancestral recombination graph (ARG). As predicted by J.E. Taylor (personal communication), these correlations will not necessarily be small even for loci separated by high recombination rate. This is a novel effect not visible in classical models. The correlation structure will of course depend on the underlying coalescent parameters introduced by the large offspring number model we adopt. An approximation of the expected value of the statistics , commonly used to quantify linkage disequilibrium, is also investigated using our ARG. In addition, we employ our ARG to investigate correlations in ratios of coalescence times between loci for samples larger than two at each locus, using simulations.
A diploid population model with multilocus recombination and skewed offspring distribution
The forward population model
Consider a population consisting of diploid individuals, meaning that each individual contains two chromosomes. Each chromosome is structured into loci. We assume Moran-type dynamics: At each timestep (‘generation’), either a small or a large reproduction event occurs. In a small reproduction event, a single individual chosen uniformly at random from the population dies, and two other distinct individuals are chosen as parents. A diploid offspring is then formed by choosing one chromosome from each parent (see Figure 1). The parents always persist. A small reproduction event occurs with probability , in which depends on . In a large reproduction event, a fraction of the population perishes, meaning that individuals die ( for denotes the largest integer smaller than ). Two distinct individuals are then chosen uniformly from the remaining individuals to act as parents of offspring, and each offspring is formed independently by choosing one (potentially recombined) chromosome from each parent (see Figure 1). The population size always stays constant at diploid individuals. Individuals that neither reproduce nor die simply persist.
Given the two parents, genetic types of the offspring individuals will then be obtained as follows. Each parent generates a large number of potential offspring chromosomes, of which a fraction are exact copies of the original parental chromosomes, and a fraction are recombinants. Each chromosome is structured into loci. Recombination occurs only between loci, and never within. If recombination between a pair of chromosomes in a parent occurs between loci and (where we say that is the crossover point), the two chromosomes exchange types at all loci from to . Only one crossover point is allowed in each recombination event. Let denote the probability of recombination between loci and (i.e., the probability that the potential crossover point equals ). An offspring chromosome is a recombinant with probability . Given that recombination happens, we thus have
Each pair of recombined chromosomes is formed independently of all other pairs. From this large pool of chromosomes, each new offspring is randomly assigned (independently of all other offspring in the case of a large reproduction event), one potentially recombined chromosome generated by each parent. In addition, the reproduction mechanism in different generations is assumed to be independent.
Ancestral relationships - notation
Now we switch from the forward population model to its ancestral process, running backwards in time. Our sample will consist of chromosomes, each subdivided into loci. Hence, we need to keep track of the ancestry of segments (types/alleles). This implies that the different segments could end up on up to distinct chromosomes in distinct ancestral individuals. The required notation will now be introduced, and our discourse will therefore necessarily become a little bit technical. However, we believe that a precise description of the objects we are working with is essential. The key to understand our notation is that we are working with enumerated chromosomes, and ordered loci on chromosomes.
At present (that is, time step ), assume that we consider an even number of chromosomes carried by individuals. The chromosomes are enumerated from 1 to , attaching consecutive numbers to chromosomes found in the same individual. Our ancestral process will keep track of the chromosomal ancestral information, that is, which locus is ancestral to which set of sampled chromosomes. That is, in each generation (backward in time), we will record all chromosomes which are active in the sense that they carry at least one locus which is ancestral to the same locus of at least one chromosome in generation 0. Denote the number of active chromosomes in generation by . The number of active chromosomes can both increase, due to recombination, and decrease, due to coalescence, going back in time.
Now we explain our notation for the loci. For each chromosome , denote by locus on chromosome at time . The subsets of contain all the numbers of chromosomes at present (time step ) to which locus on active chromosome number at time step is ancestral. With this convention, and for each and , the collection
which describes the configuration of segments (i.e. which have coalesced and which have not) at locus at time , is a partition of , i.e.
Thus, with our notation we can correctly describe the configuration of segments among chromosomes at any given time. By we denote chromosome number at time . At time ,
For , consider the -th active chromosome at generation , where . The corresponding ancestral information at generation is encoded via an ordered list of subsets of , setting
Chromosomes are carried by diploid individuals. Keeping track of the grouping of active chromosomes into individuals will be important, since by our diploid reproduction mechanism, chromosomal lineages can only coalesce when in distinct individuals (see Example B below). In analogy with our previous nomenclature for our ancestral process, an active individual will carry at least one (and at most two) active chromosome(s). Let denote the number of active individuals at generation where for all . The ordered list of active chromosomes and the number of active individuals (called a ‘configuration’) at time is denoted by
An individual number at generation is denoted by ,
for . An active individual is single-marked, if
carrying one active chromosome, and is double-marked, if
carrying two active chromosomes. Specifying the arrangement of
chromosomes in individuals completes our description of the
(prelimiting) ancestral process. However, since all active
individuals are single-marked in the limiting process, our description
of the arrangement of chromosomes in individuals is given in Section
1.1.1 in the Appendix.
That is, each configuration begins with the
ordered consecutive chromosomes of the
double marked individuals, followed by the
chromosomes contained in single-marked individuals.
With this convention, the set of single- and double marked
individuals and the grouping of chromosomes into individuals at
generation is uniquely determined by a configuration of form (2).
For notational convenience, the time index will be omitted if there is no ambiguity.
For a given sample size , the set of all possible ancestral configurations will be denoted by . The subset of all configurations with , i.e. configurations consisting only of single-marked individuals, will play an important rôle later on. Indeed, all configurations in the limiting model will be confined to the set , and the pairing of chromosomes in individuals will become irrelevant.
The mapping (‘complete dispersion’)
breaks up the pairing of chromosomes into diploid double-marked individuals. More precisely, we define
Configurations in describe configurations in which all active individuals are single marked, i.e. carry only one active chromosome.
The effects of recombination and coalescence on the ancestral configurations in the case of two typical situations will now be illustrated. Example A will illustrate recombination, and Example B will illustrate coalescence of two chromosomes.
Example A. Suppose the most recent previous event in the history of a given configuration was a small reproduction event (at time ), and suppose that the resulting offspring individual is currently part of our configuration at time , but neither of its parents is, and that the offspring individual is single-marked, i.e. carries one active chromosome. We obtain as follows:
If there is no recombination during the reproduction event, then the configuration in the previous generation remains unchanged, i.e. .
If there is recombination, say at a crossover point , suppose the (single) offspring chromosome is
Necessarily, the two parental chromosomes will be part of the configuration , residing in the same double-marked individual. More precisely, the two parental chromosomes, say and , are determined by (for )
in which denotes loci not carrying any ancestral segments. The offspring chromosome is of course not part of . This transition can be partially trivial (a ‘silent recombination’ event), if the crossover point is not in an ‘active’ area, i.e. if for (or for all ). By way of example, with , if chromosome was a recombinant, and the crossover point occurred between loci 2 and 3, the two parental chromosomes are given by and .
Example B. Suppose the most recent previous event in the history of a given configuration of chromosomes at generation is a small reproduction event at time , leading to a coalescence of lineages. This is the case e.g. if both a single-marked offspring individual with active chromosome is in our configuration , as well as its single marked parent (say with currently active chromosome ), from which it actually obtained its active chromosome. Then, to obtain the configuration , the offspring chromosome is deleted, and the resulting ancestral chromosome is given by the family of the union of the sets and ,
All other chromosomes in are copied from . Again, taking , if chromosomes and coalesce, the resulting ancestral chromosome is given by .
Scaling and classification of transitions
In order to obtain a non-trivial scaling limit for as , the limit theorem of (Möhle and Sagitov, 2001) (cf also the special case considered in (Eldon and Wakeley, 2006)) suggests one should, for some constant , choose probability for the small reproduction events, for the large reproduction events, i.e., setting
and speed up time by . For the recombination rate to be non-trivial in the limit (i.e. neither 0 nor infinitely large), we require that all recombination values scale in units of , i.e. for each crossover point ,
Thus, even though our timescale is in units of timesteps, recombination is scaled in units of timesteps. On the level of single lineages the probability of recombination is of the order . Indeed, after a small reproduction event, the probability of drawing an offspring is . The probability that the offspring carries a recombined chromosome is of order .
Given the cornucopia of possible transitions from to , it will be important to identify those transitions which are expected to be visible in the limiting process.
All possible transitions fall into the following three regimes:
Those transitions which happen at probability of order per generation, which will be visible in the limit (since time will be scaled by ). They will be called effective transitions and will appear at a finite positive rate in the limit.
Further, there are transitions which happen less frequently, typically with probability of order or smaller per generation, which will thus become negligible as and hence be invisible in the limit. These will be called negligible transitions.
Finally, there are transitions which happen much more frequently (with probability of order or even per generation). At first sight, one might think that their presence might lead to chaotic behaviour in the limit. However, this will not be case. Instead, these transition will happen ‘instantaneously’ in the limit, and result in a projection of the states of our process from into the subspace , which will be the limiting statespace. This will be proved below. Such transitions will be called projective or instantaneous transitions. The identity transition is a special case of a projective transformation.
In the Appendix (section A.1), a full classification of all transitions into the above groups is provided.
Instantaneous and effective transitions
The most important transitions and their effect for the limiting process will now be described in detail. Consider the following most recent events in the history of a set of lineages, i.e. events occurring at time , from the perspective of the ancestral process at time :
Event 1 (silent): A small reproduction event occurs, but the offspring is not active. This is the most likely event, and is of the order , but does not affect our ancestral configuration process , i.e. . This event leads to an identity transition (a trivial instantaneous transition).
Event 2 (dispersion): A small reproduction event occurs, the offspring is active in our sample but neither parent is, and recombination does not occur. This is a relatively frequent event which occurs with a probability of the order per generation (since the probability that the offspring is in the sample is ). If the offspring carries only one active chromosome, we again see an identity transition, i.e . If the offspring carries two active chromosomes, i.e. is a double-marked individual, the two active chromosomes will disperse to two separate individuals, who will then become single-marked individuals. Formally, for with at least one double-marked individual , define the map dispersing the chromosomes paired in individual ,
if and otherwise. Recall that the double-marked individual has chromosomes labelled and . For , if the -th double marked individual is affected, we have the transition .
The dispersion events will happen instantaneously as (recall we are speeding time up by ), and thus will, in the limit, lead to an immediate complete dispersion of all chromosomes paired in double-marked individuals. If in the course of events, a new double-marked individual emerges due to pairing of active chromosomes in the same diploid individual, a dispersion of the chromosomes will occur immediately. Event 2 will hence result in a permanent instantaneous transition, mapping our current state into the subspace by means of the map defined in (3). Our limiting process will thus live, with probability one for each given , in , even if we start with a configuration from at time .
Event 3 (recombination): A small reproduction event occurs, a single-marked offspring but neither parent is in our sample, and recombination affecting the active chromosome at a crossover point . This event has probability of the order per generation, and will thus be visible with finite positive rate in the limit. It is an effective transition, which can be described formally as follows. Define the recombination operation acting on chromosome and crossover point for a configuration as
(if one of , equals , we define , giving rise to a silent recombination event).
Event 4 (pairwise coalescence): A small reproduction event occurs, one single-marked parent and a single-marked offspring are in the sample, the active chromosome is inherited from the parent in the sample, and recombination does not occur. This event occurs with probability of order and will therefore be visible in the limit with finite positive rate, hence gives rise to an effective transition. It will lead to a binary coalescence of lineages and can formally be described as follows. The ancestral chromosome formed by the coalescence of chromosomes and is given by
if . Define the binary coalescence operation acting on chromosomes and in a configuration as
if (otherwise, we put ).
Event 5 (multiple merger coalescence): A large reproduction event occurs, neither parent but (possibly several) single marked offspring are in our sample, and recombination does not occur. This is again an event with probability of order per generation and therefore will be visible in the limit with finite positive rate, hence gives rise to an effective transition. The offspring chromosomes will be assigned their parental chromosomes independently and uniformly at random, since due to an immediate ‘complete dispersion’ via Event 2 each offspring individual will carry precisely one active chromosome. Now we formally define the multiple coalescence operation for and pairwise disjoint subsets in which either at least one or at least two of the . This transition is, thus, really different from a transition. Let denote the set of offspring chromosomes derived from parental chromosome . Then
and the four parental chromosomes, at least one of which is involved in a merger, are given by ,
The chromosome(s) appaering in denote the chromosomes in that are not involved in a merger.
All other events: Will either not affect our ancestral process, or have a probability of order smaller than so that they will be absent in the limit after rescaling. A complete classification of these events will be given in the Appendix (section A.1).
The limiting dynamics and state space
The expected dynamics of the limiting continuous time Markov chain , taking values in , as , will now briefly be discussed.
Complete dispersion (Event 2) of the sampled chromosomes is the first event to occur (between times and ). By we denote individual number (see section 1.1.1 in Appendix). At time when we assume all sampled chromosomes are paired in double-marked individuals ( even);
Immediately (at time ), the chromosomes disperse into single-marked individuals,
Throughout the evolution of the process, whenever double marked individuals appear (e.g. from a coalescence of lineages event), Event 2 will immediately change our configuration to the corresponding ‘all dispersed’-configuration, i.e., for each ,
Such ‘flickering’ states will not affect any quantities of interest of our genealogy, so we can assume that they will be removed from the limit by choosing the càdlàg modification of , taking only values in for all (this modification does not affect the finite-dimensional distributions of ).
Recombination (Event 3) appears in the limiting process at total rate , where a certain recombination involving a given crossover point appears with rate on any lineage. Indeed, from our scaling considerations, we have for the probability of not seeing a recombination at in a small resampling event for more than scaled time units for a given single-marked individual satisfies
as (recall (6); the probability for any given individual to be the child in a small reproduction event is ), hence the waiting time for this event to happen is exponential with rate .
Coalescences appear according to the effective transitions described by Event 4 and Event 5. From the point of view of a given pair of active chromosomes in different individuals, a single pairwise coalescence will occur at rate with from (15) (with , ), where the comes from a pairwise coalescence according to a small reproduction event, and the from a large merger event (the rates can be easily derived from considerations similar to the recombination rate above), recalling that both coalescing chromosomes have to ‘successfully flip a -coin’ in order to take part in the large coalescence event, and then are uniformly distributed into four groups according to the choice of any of the four potential parental chromosomes.
Given large coalescence events (involving at least three individuals, or at least two simultaneous pairwise mergers) happen with overall rate times the corresponding coalescence rate of a -coalescent, obtained from the number of individuals taking part in the merger independently with probability . The participating individuals are then being distributed uniformly into four groups according to the chosen parental chromosome. The corresponding rate is given in the third line of (14) (cf also (15)).
The limiting ancestral process
According to the above consideration, it is now plausible to consider the following limiting Markov chain as the ancestral limiting process. This fact will be proved below, with most computations provided in the Appendix. The -th falling factorial is given by , . The operations , and for elements of were defined above in the section on scaling. Now we define the generator of the continuous-time ancestral recombination graph derived from our model.
Definition 1.1 (Limiting multilocus diploid ancestral recombination graph).
The continuous-time Markov chain with values in , initial condition for and transition matrix , with entries for elements is given by ,
(where in the penultimate line we only consider cases where either at least one or at least two of the ), with
For the diagonal elements, one has of course
The rates in (15) are the transition rates of the -coalescent (a simultaneous multiple merger coalescent) with
when distinct groups of ancestral lineages merge. The number of lineages in each group is given by , given active ancestral lineages. The number gives the number of lineages (ancestral chromosomes) unaffected by the merger (cf. Schweinsberg (2000a), Thm. 2). The particular form of given above follows from the fraction of the population replaced by the offspring of the two parents in a large reproduction event, and our assumption that each parent contributes exactly one chromosome to each offspring. We have the following convergence result.
A proof can be found in the Appendix. If , the classical ancestral recombination graph for a diploid population with recombination in the spirit of Griffiths and Marjoram (1997) results.
General diploid Moran-type models: “random”
One of the aims of the present work is to understand the genome-wide correlations in gene genealogies induced by sweepstake-style reproduction. So far, we have discussed this for a very simple example of a sweepstake mechanism (analog to the one considered in Eldon and Wakeley (2006)). More precisely, the fraction of the population replaced by the offspring of a single pair of individuals in a large offspring number event has hitherto been assumed to be (approximately) constant. Along the lines of the previous discussion, an ancestral recombination graph with a randomized offspring distribution can be derived (a comprehensive discussion of single-locus haploid Moran models in the domain of attraction of -coalescents can be found in a recent article of Huillet and Möhle (2011)). Even though is now considered a random variable, the population size stays constant at diploid individuals. Allowing to be random may be biologically more realistic than taking to be a constant. On the other hand, the problem of identifying suitable classes of probability distributions for , reflecting the specific biology of given natural populations, is still open and an area of active research.
To explain the convergence arguments when is random, let the random variable , taking values in , denote the random number of diploid offspring contributed by the single reproducing pair of parents at each timestep; a new realisation of is drawn before each reproduction event. Again, we consider the effect of such a reproduction mechanism on coalescence events in a sample. The probability that two given chromosomes residing in two single-marked individuals in the sample coalesce in the previous timestep given the value of is
where the first and second terms on the right-hand side describe the case where one parent and one offspring are drawn, the third term covers the case where two offspring are drawn, and the accounts for the probability that the two chromosomes in question must descend from the same parental chromosome. Define
(the factor facilitates comparison with the haploid case). The sequence of laws , , will be assumed to satisfy the following three conditions:
and there exists a probability measure on such that
for all continuity points of .
Condition (20) is necessary for any limit process of the genealogies to be a continuous-time Markov chain, condition (21) ensures that a separation of time scales phenomenon occurs, and (22) fixes the limit dynamics of the large merging events (it is analogous to (Sagitov, 1999, necessary condition (13)) in the haploid case). In the proof of convergence to a limit process we will recall equivalent conditions to (22) (see Appendix, section A.4). Condition (20) implies (see Section A.4 in Appendix)
i.e. the probability for a given individual to be an offspring in a given reproduction event becomes small. Hence, (23) and (21) together show that there will be two diverging time-scales: The “short” time-scale on which chromosomes paired in double-marked individuals disperse into single-marked individuals and the “long” time-scale over which we observe non-trivial ancestral coalescences.
In order to obtain a non-trivial genealogical limit process, we will then speed up time by a factor of , i.e., reproduction events correspond to one coalescent time unit (see Thm. 1.3 below). This time rescaling is chosen in order for two chromosomes to coalesce at rate 1 in the limit. The required scaling relation for the recombination rates is now
with fixed for (where means ). An intuitive explanation for the requirement (24) is that since the probability for a given individual to be an offspring in a given reproduction event is , after speeding up time by , on any lineage recombination events between locus and occur as a Poisson process with rate .
A simple sufficient condition for (21) is the following: For any ,
Indeed, we have, by assuming ,
Dividing by gives
and, since ,
Thus, condition (21) is obtained since we can choose to be as small as we like.
The limiting genealogical process will then be a continuous-time Markov chain on with generator matrix whose off-diagonal elements are given by (for the values on the diagonal we again have (16))
, , and
with from (22). As in the case of constant , the third line in (26) gives the transition rates for a given merger into groups of sizes when active ancestral lineages are present, with lineages unaffected by a given merger of the -coalescent with
Let be the ancestral process of a sample of chromosomes in a population of size with offspring laws which satisfy (20), (21) and (22), and assume the scaling relation (24) for the recombination rates. Then, starting from , we have that
in the sense of the finite-dimensional distributions on the interval . The process is the Markov chain with generator matrix (26) and initial value given by
The proof is given in Section A.4 in Appendix.
While by definition, in principle any decay behaviour of that is consistent with , and hence any therefrom derived scaling relation between coalescent time scale and model census population size, is possible via a suitable choice of the family