Tractable diffusion and coalescent processes for weakly correlated loci

Tractable diffusion and coalescent processes for weakly correlated loci

\fnmsPaul A. \snmJenkinslabel=e1]p.jenkins@warwick.ac.uk\thanksreft1 [ Department of Statistics
University of Warwick
Coventry CV4 7AL
UK
\printeade1
University of Warwick
   \fnmsPaul \snmFearnheadlabel=e2]p.fearnhead@lancaster.ac.uk [ Department of Mathematics and Statistics
Lancaster University
Lancaster LA1 4YF
UK
\printeade2
Lancaster University
   \fnmsYun S. \snmSong\correflabel=e3]yss@stat.berkeley.edu\thanksreft2 [ Department of Statistics and
 Computer Science Division
University of California, Berkeley
Berkeley, CA 94720
USA
\printeade3
University of California, Berkeley
Abstract

Widely used models in genetics include the Wright-Fisher diffusion and its moment dual, Kingman’s coalescent. Each has a multilocus extension but under neither extension is the sampling distribution available in closed-form, and their computation is extremely difficult. In this paper we derive two new multilocus population genetic models, one a diffusion and the other a coalescent process, which are much simpler than the standard models, but which capture their key properties for large recombination rates. The diffusion model is based on a central limit theorem for density dependent population processes, and we show that the sampling distribution is a linear combination of moments of Gaussian distributions and hence available in closed-form. The coalescent process is based on a probabilistic coupling of the ancestral recombination graph to a simpler genealogical process which exposes the leading dynamics of the former. We further demonstrate that when we consider the sampling distribution as an asymptotic expansion in inverse powers of the recombination parameter, the sampling distributions of the new models agree with the standard ones up to the first two orders.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Evolution of weakly correlated loci

\thankstext

t1Supported in part by EPSRC Research Grant EP/L018497/1 and an NIH Grant R01-GM094402. and and \thankstextt2Supported in part by an NIH Grant R01-GM094402, and a Packard Fellowship for Science and Engineering.

class=MSC] \kwd[Primary ]92D15 \kwd[; secondary ]65C50,92D10

population genetics \kwdrecombination \kwdsampling distribution \kwddiffusion \kwdcoupling

1 Introduction

The basis of many important problems in genetics is to find an expression for a sampling distribution or likelihood. Valuable tools in this endeavour are stochastic models of allele frequency evolution forwards in time, and their dual genealogical processes backwards in time. In particular, the numerous variants of the Wright-Fisher diffusion and Kingman’s coalescent, respectively, have focused attention on the scaling limit as the population size goes to infinity, leading from a (complicated) finite-population model of reproduction to a (simpler) infinite-population limit. At a single genetic locus, the problem of computing sampling distributions in these models is well studied, with even some closed-form formulas available (Wright, 1949; Ewens, 1972; Jenkins and Song, 2011; Bhaskar, Kamm and Song, 2012). However, with ongoing technological developments in high-throughput DNA sequencing, large genomic datasets are becoming available and it is necessary to consider multilocus models. Inter-locus recombination quickly makes such models intractable; for neither the Wright-Fisher diffusion with recombination nor the coalescent with recombination—or ancestral recombination graph (ARG)—is it possible to obtain a closed-form expression for the sampling distribution. This has remained a notoriously difficult problem, and to make progress using these models it has usually been necessary to resort to computationally-intensive techniques such as importance sampling (Griffiths and Marjoram, 1996; Fearnhead and Donnelly, 2001; Griffiths, Jenkins and Song, 2008; Jenkins and Griffiths, 2011), Markov chain Monte Carlo (Kuhner, Yamato and Felsenstein, 2000; Nielsen, 2000; Wang and Rannala, 2008; Rasmussen et al., 2014), or other numerical approximations (Boitard and Loisel, 2007; Miura, 2011). Denoting the population-scaled recombination parameter by , only in the special cases of or is it possible to make progress analytically, since then we are back to a single locus, or to many independent single loci, respectively.

In another direction, we have considered an analytic approach to the problem, as follows. Denote the observed sample configuration at two loci by and its sampling probability by (to be defined precisely below). Consider the asymptotic expansion in inverse powers of :

(1)

where for convenience we suppress the dependence of these terms on other parameters of the model. Under an infinite-alleles type of mutation, we obtained closed-form formulas for and in terms of the marginal one-locus sampling probabilities, and a decomposition of into a closed-form term plus a second part which is evaluated easily by dynamic programming (Jenkins and Song, 2010). (The result is stated more precisely in Theorem 2.1 below.) This provides the first closed-form extension of Ewens’ Sampling Formula (Ewens, 1972) to handle finite amounts of recombination. It has been extended subsequently to include more general models of mutation (Jenkins and Song, 2009), natural selection (Jenkins and Song, 2012), higher-order terms (Jenkins and Song, 2012), and more than two loci (Bhaskar and Song, 2012), and has had practical implications for genomic inference (Chan, Jenkins and Song, 2012). One particularly appealing conclusion of these works is that both and are universal; that is, their functional form is invariant to our assumptions about mutation and selection acting marginally at each locus. The effects of these marginal processes are entirely subsumed into the relevant one-locus sampling distributions.

The simple and universal forms for and provide strong circumstantial evidence that there exists an underlying stochastic process which is much simpler than the standard models for finite amounts of recombination. In particular, we previously conjectured (Jenkins and Song, 2010) the existence of a process which is both much simpler than the standard models based on the Wright-Fisher diffusion or on the ARG, and is in agreement with the sampling distribution (1) up to . The goal of this paper is to describe such a process. In fact, using different arguments we describe two such processes, obtaining both a limiting diffusion and a coalescent process with these properties. In the diffusion approximation, the key idea is to suppose that the probability of a recombination per individual per generation scales as as the population size , for , rather than the usual choice of . Interest in asymptotically large recombination rates is reasonable because of extensive recombination rate heterogeneity along chromosomes in e.g. humans, strong recombination rates in some species such as Drosophila melanogaster (Chan, Jenkins and Song, 2012), and because of the need to understand the long-range dependencies between well-separated loci. Our diffusion in this scaling is intimately related to the central limit theorem for density dependent population processes (see Ethier and Kurtz, 1986, Theorem 11.2.3), which has been analyzed in genetics—for models of strong mutation rather than strong recombination—by Feller (1951) and Norman (1975a). A closely related scaling in the context of -coalescent processes was also recently explored by Birkner, Blath and Eldon (2013) (in that paper but with timescale ). The coalescent approach, meanwhile, uses a coupling argument. Intuitively, we would like to couple the ARG to the limiting case of two independent coalescent trees (). To account for contributions to the sampling distribution of , we must quantify the “leading order reasons” for such a coupling to fail. When is large but finite, lineages in the ARG ancestral to both loci undergo recombination backwards in time very rapidly, until the first time that no such lineage survives. In this paper we show that, roughly speaking, in order to recover the sampling distribution up to we need consider only the following type of exceptional event: a coalescence occurs more recently than time in the ARG, and the coalescence is between two lineages each of which is ancestral to both of the two loci. This observation enables us to define a simple coalescent process which allows for at most one of these events but is otherwise very similar to the easy limiting process corresponding to .

The paper is organized as follows. In Section 2 we specify our notation and summarize previous research. Novel diffusion and coalescent processes are introduced in Sections 3 and 4, respectively, and we conclude in Section 5 with a brief discussion.

2 Notation and previous results

For , let . The complement of a set is written . Denote the Kronecker delta by which takes the value 1 if and 0 otherwise. Let denote a unit vector whose th entry is , and let denote a matrix with th entry equal to . For a vector we denote by the usual Euclidean norm. Denote the zero matrix by and the identity matrix by . We will replace a subscript with a “” to denote summation over that index. A prime symbol will denote vector or matrix transpose. For and , denotes the th ascending factorial of . Finally, for a matrix of processes we let denote the matrix of corresponding covariation processes.

Consider the usual diffusion limit of an exchangeable model of random mating with constant population size of haplotypes. Our interest will be in a sample from this population at two loci, which we call A and B, with the probability of mutation per haplotype per generation denoted by and respectively. In the diffusion limit we let and , while the population-scaled parameters and remain fixed. In this paper we will suppose a finite-alleles model of mutation such that a mutation to an allele in type space , , takes it to allele with probability , with and , defined analogously. (As we discover below, the mutation model is not important and we could pose something more complicated with little extra effort.) The probability of a recombination between the two loci per haplotype per generation is denoted by , and we assume that is fixed as , for some fixed . Previous work has focused on the case with time measured in units of generations. For consistency with the usual notation we write .

A sample from this model comprises haplotypes observed only at locus A, haplotypes observed only at locus B, and haplotypes observed at both loci. The sample configuration is denoted by where and is the number of haplotypes observed to exhibit allele at locus A; where is the number of haplotypes observed to exhibit allele at locus B; and where is the number of haplotypes with allele at locus A and allele at locus B. Thus,

and we let . We further write and to denote the marginal sample configurations of restricted to locus A and locus B respectively. Finally, we use to denote the probability that when we sample haplotypes in some order from the population at stationarity we obtain the unordered configuration ; by sampling exchangeability this is indeed a function only of the unordered configuration . For convenience we suppress the dependence of this quantity on the model parameters and on . The main result motivating this work is an expansion for for the case of , and later we will show that this expansion holds for all .

Theorem 2.1 (See Jenkins and Song (2009)).

Consider the following asymptotic expansion for under the diffusion limit with :

with , , independent of . Then the zeroth order term is given by

(2)

and the first order term is given by

(3)

where , are the marginal sampling distributions at locus A and locus B, respectively.

Remark 2.1.

Under a neutral, finite-alleles model of mutation, if mutation is parent independent—that is, , , and , , then and are known in closed-form:

These expressions follow, for example, from the moments of the Wright-Fisher diffusion with parent-independent mutation, whose stationary distribution at locus A is (Wright, 1949), and similarly at locus B.

Remark 2.2.

The zeroth-order decomposition is well known (e.g. Ethier, 1979) and also intuitive, since the two loci become independent as .

Theorem 2.1 can be obtained by diffusion (Jenkins and Song, 2012) or by coalescent (Jenkins and Song, 2009, 2010) arguments. In this paper we address both approaches in further detail.

3 Diffusion model

In this section we extend the above results by obtaining a full description of a simple diffusion process such that its sampling distribution is known exactly and has a Taylor expansion about consistent with (2) and (2.1). For simplicity we will obtain our diffusion as the limit of an appropriately rescaled Moran model, although we expect our results to hold for a more general class of discrete models of reproduction within the domain of convergence of the Wright-Fisher diffusion.

3.1 Neutral Moran model

A population of haploid, monoecious individuals evolves as a multitype birth-and-death process in continuous time. Each individual carries a haplotype comprising a pair of alleles , one at locus A and one at locus B. Let denote the number of haplotypes in the population at time , and . The population evolves as follows. At rate a reproduction event occurs, in which an individual is chosen uniformly at random from the population to die. It is replaced by a copy of another individual also chosen uniformly at random (the same individual could be chosen; whether sampling is with or without replacement does not affect the diffusion limit). Independently, each locus of each haplotype undergoes mutation: any locus A mutates at rate and its allele is updated according to the transition matrix ; similarly any locus B mutates at rate and its allele is updated according to . Finally, each haplotype independently undergoes recombination at rate : at such an event, it is replaced by a haplotype formed by sampling two alleles (one for each locus) independently from the population. Putting all this together, the rate at which a haplotype dies and is replaced by a haplotype when is given by

Notice that, as is standard (e.g. Baake and Herms, 2008), we decouple the mutation and recombination mechanisms from reproduction (and from each other). This simplifies the analysis without unduly affecting the diffusion limit.

We will change variables by introducing the collection

where

That is, we describe the state of the Moran model at time by the marginal allele frequencies and the coefficients of linkage disequilibrium (see, e.g. Ewens, 2004, p69, p227). We will write this succinctly by arranging the variables in a linear order:

and thinking of as a vector of length . The process is then Markov on a state space we denote by , which is a rational subset (those points consistent with ) of the -dimensional shifted simplex

To find the diffusion limit we first need the conditional means and covariances of the increments

From these, and under the assumption that , , and are fixed as , it is possible to show that the model converges to a (Wright-Fisher) diffusion limit (Ethier and Kurtz, 1986, Example 10.3.9, p433). Recall however that our interest is when , rather than , is fixed, so below we write these increments in terms of using .

In the following, for convenience we drop the dependence on .

Proposition 3.1.

In the neutral two-locus Moran model with mutation and recombination, the conditional means and covariances of increments of are given by

(4)
(5)
(6)

Higher order moments of order are .

Proof.

These expressions follow directly from the first four moments of , which are easily computed by noting that

For example, choosing we find

and hence we recover (4) via

The remaining terms follow similarly; we omit the straightforward but lengthy algebraic details. ∎

To prepare for our diffusion limit, we must rescale time; from (6) it is clear that to obtain a nontrivial limit we should let . Now introduce the conditional mean vector and conditional covariance matrix on this timescale, defined by

(7)
(8)

with entries determined by Proposition 3.1. Thus, with , equations (4)–(6) show that

where (9)

with determined in a similar fashion:

where

Notice in particular the different leading orders of the two quantities in (7) and (8): the mean increments are of on this timescale while the covariances are of . It is this difference, which is a consequence of our assumption that the recombination probability is for , that leads to a novel diffusion limit. Under the usual choice of it is well known that we see convergence to a diffusion process after a linear rescaling of time. In the special case of a Wright-Fisher model and , the diffusion limit for as was obtained by Ohta and Kimura (1969a, b). Our interest is however in , for which is larger, and the loss of linkage disequilibrium (LD) is subsequently much faster. Intuitively, we should expect such loss to resemble the exponential decay predicted in an infinitely large population, but with small fluctuations about this deterministic behaviour. The diffusion process we define below quantifies these fluctuations precisely.

3.2 Gaussian diffusion limit of fluctuations in linkage disequilibrium

We first provide a heuristic description of the diffusion limit. First, observe from (7) and (8) that, provided as and that , then

(10)

the deterministic exponential decay in LD typical of an infinitely large population. See Baake and Herms (2008) for a formal statement of this law-of-large-numbers type result for the Moran model with recombination. For the corresponding central limit theorem, we seek a diffusion limit for

(11)

for some rescaling . In our application the appropriate choice is

which can be regarded as the one on which both recombination and genetic drift are observable on the fastest timescale (Jenkins and Song, 2012). We will assume this scaling henceforward. To find the limit , write

(12)

where

describes the deviations of from its expected behaviour and is a martingale. It suffices to characterize the limits of each of the three grouped terms on the right of (12). For the first term we assume that it converges to a limit, as . For the second term, from (9) we should expect

(13)

Finally, we obtain a complete description of the limit as by an application of the martingale central limit theorem (Ethier and Kurtz, 1986, Theorem 7.1.4); we find

where , and is a -dimensional Brownian motion. In summary then, we expect to satisfy

(14)

Our main result formalizes this argument, as follows.

Theorem 3.1.

Suppose that as . Then for each , as ,

, where has Gaussian, independent increments with mean zero, and with

(15)

and , satisfying (14).

Proof of Theorem 3.1.

This is an application of a central limit theorem for density dependent population processes; for textbook coverage see Ethier and Kurtz (1986, Chapter 11) and for a recent treatment see Kang, Kurtz and Popovic (2014). We apply Theorem 2.11 of Kang, Kurtz and Popovic (2014). To do so we need to validate each of the assertions that led to (14) above by checking the following sufficient conditions (i)–(iv). (Kang, Kurtz and Popovic (2014, Theorem 2.11) is rather more general than is required here: it permits the state space of to be unbounded, and for to depend on other processes that evolve on faster timescales than that of the diffusion. We omit those conditions which are not needed.)

(i) The Moran process converges to an identifiable, deterministic limit. This is guaranteed by the following: the infinitesimal generator of satisfies

for a generator with domain .

(ii) Fluctuations about the deterministic limit are well behaved. More precisely, is a local martingale and the covariations processes .

(iii) Contributions of to the error can be identified. These would contribute to the limiting drift of , and a sufficient condition to identify them is: there exists a continuous function (recall ) such that

(iv) The martingale central limit theorem applies to . This is guaranteed by the following:

(16)

and there exists a continuous such that for each ,

(17)

We address each of these requirements in turn.

(i) Convergence of to , the generator of [see (10)], is immediate from Proposition 3.1. Convergence is uniform in because the terms in Proposition 3.1 have coefficients that are polynomials in on a compact space.

(ii) Since the state space is bounded, for to be a martingale it suffices that the jump rate is uniformly bounded (Kurtz, 1971, Proposition 2.1), as is the case for the Moran process. The covariations process as a consequence of (17), verified below.

(iii) From (9), , again uniformly in , so here the appropriate choice is . Thus, the only relevant contribution to the limit (13) is from the error rather than from .

(iv) Jumps of any component of are bounded in magnitude by , so

and (16) holds. To identify the asymptotic behaviour of , let

denote the total number of jumps of the Moran process into state by time , where is a collection of independent Poisson processes of unit rate and denotes the rate of transition of the process from current state to . Then

by (8). Thus we may take in (17) [ identifies the moments appearing in (15)]. ∎

Remark 3.1.

One could obtain the same diffusion limit starting from a Wright-Fisher model rather than a Moran model, since the means and covariances of its increments are identical to leading order, up to a rescaling of time. This alternative approach is in some respects less appealing since the Wright-Fisher model, when expressed in continuous time, is non-Markovian. The additional complications raised by this approach have been addressed by Norman (1975a) (see also Ethier and Nagylaki, 1980, 1988), and we have checked that the conditions of his theorems still apply when we introduce recombination to the Wright-Fisher model. The theory of Norman (1975a) has been used to study strong mutation and selection (Norman, 1972, 1975a; Kaplan, Darden and Hudson, 1988; Nagylaki, 1986, 1990; Wakeley and Sargsyan, 2009), and a Gaussian diffusion approximation of a Moran model with strong selection is developed by Feder, Kryazhimskiy and Plotkin (2014), but to the best of our knowledge this is the first time a central limit theorem has been obtained for strong recombination.

Remark 3.2.

The exponential decay of linkage disequilibrium implied by [equation (10)] is a classical result; the above theorem further quantifies the fluctuations about this deterministic behaviour in a fully time-dependent manner. In particular, the definition of [equation (11)] shows that fluctuations are of order