A Appendix

Abstract

In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

Decoding coalescent hidden Markov models in linear time

Kelley Harris, Sara Sheehan, John A. Kamm, and Yun S. Song

Department of Mathematics, University of California, Berkeley

Computer Science Division, University of California, Berkeley

Department of Statistics, University of California, Berkeley

Department of Integrative Biology, University of California, Berkeley


kharris@math.berkeley.edu, ssheehan@eecs.berkeley.edu, jkamm@stat.berkeley.edu, yss@eecs.berkeley.edu

To appear in the Proceedings of the 18th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2014). The final publication is available at link.springer.com.

1 Introduction

The hidden Markov model (HMM) is a natural and powerful device for learning functional and evolutionary attributes of DNA sequence data. Given an emitted sequence of base pairs or amino acids, the HMM is well-suited to locating hidden features of interest such as genes and promotor regions [2, 5]. HMMs can also be used to infer hidden attributes of a collection of related DNA sequences. In this case, emitted states are a tuple of A’s, C’s, G’s and T’s, and the diversity of emitted states in a particular region can be used to infer the local evolutionary history of the sequences. When two sequences are identical throughout a long genetic region, they most likely inherited that region identical by descent from a recent common ancestor. Conversely, high genetic divergence indicates that the sequences diverged from a very ancient common ancestor [1, 15].

In recent years, coalescent HMMs such as the Pairwise Sequentially Markov Coalescent (PSMC) [15] have been used to infer the sequence of times to most recent common ancestor (TMRCAs) along a pair of homologous DNA sequences. Two other coalescent HMMs (CoalHMM [12, 4, 16] and diCal [24, 25]) also tackle the problem of inferring genealogical information in samples of more than two haplotypes. These methods are all derived from the coalescent with recombination, a stochastic process that encapsulates the history of a collection of DNA sequences as an ancestral recombination graph (ARG) [13, 29]. The hidden state associated with each genetic locus is a tree with time-weighted edges, and neighboring trees in the sequence are highly correlated with each other. Sequential changes in tree structure reflect the process of genetic recombination that slowly breaks up ancestral haplotypes over time.

The methods mentioned above all infer approximate ARGs for the purpose of demographic inference, either detecting historical changes in effective population size or estimating times of divergence and admixture between different populations or species. PSMC and CoalHMM have been used to infer ancestral population sizes in a variety of non-model organisms for which only a single genome is available [6, 19, 17, 30, 20, 28], as well as for the Neanderthal and Denisovan archaic hominid genomes [18]. Despite this progress, the demographic inference problem is far from solved, even for extremely well-studied species like Homo sapiens and Drosophila melanogaster [23, 7, 15, 27, 9]. Estimates of the population divergence time between European and African humans range from 50 to 120 thousand years ago (kya), while estimates of the speciation time between polar bears and brown bears range from 50 kya to 4 million years ago [10, 19, 3]. One reason that different demographic methods often infer conflicting histories is that they make different trade-offs between the mathematical precision of the model and scalability to larger input datasets. This is even true within the class of coalescent HMMs, which are much more similar to each other than to methods that infer demography from summary statistics [8, 21, 11] or Markov chain Monte Carlo [7].

Exact inference of the posterior distribution of ARGs given data is a very challenging problem, the major reason being that the space of hidden states is infinite, parameterized by continuous coalescence times. In practice, when a coalescent HMM is implemented, time needs to be discretized and confined to a finite range of values. It is a difficult problem to choose an optimal time discretization that balances the information content of a dataset, the complexity of the analysis, and the desire to infer particular periods of history at high resolution. Recent demographic history is often of particular interest, but large sample sizes are needed to distinguish between the population sizes at time points that are very close together or very close to the present.

In a coalescent HMM under a given demographic model, optimal demographic parameters can be inferred using an expectation-maximization (EM) algorithm. The speed of this EM algorithm is a function of at least three variables: the length of the genomic region being analyzed, the number of sampled haplotypes, and the number of states for discretized time. In most cases, the complexity is linear in , but the complexity in can be enormous because the number of distinct -leaved tree topologies grows super-exponentially with . PSMC and CoalHMM avoid this problem by restricting to be very small, analyzing no more than four haplotypes at a time. diCal admits larger values of by using a trunk genealogy approximation (see [22, 24, 25] for details) which is derived from the diffusion process dual to the coalescent process, sacrificing information about the exact structure of local genealogies in order to analyze large samples which are informative about the recent past.

To date, all published coalescent HMMs have had quadratic complexity in . This presents a significant limitation given that small values of lead to biased parameter estimates [16] and limit the power of the method to resolve complex demographic histories. PSMC is typically run with a discretization of size , but diCal and CoalHMM analyses of larger datasets are restricted to coarser discretizations by the cost of increasing the sample size. In this paper, we exploit the natural symmetries of the coalescent process to derive an alternate EM algorithm with linear complexity in . The speedup requires no approximations to the usual forward-backward probabilities; we perform an exact computation of the likelihood in time rather than time using an augmented HMM. We implement the algorithms presented in this paper to speed up our published method diCal, which is equivalent to PSMC when the sample size is two, yielding results of the same quality as earlier work in a fraction of the runtime. We have included the speedup in the most recent version of our program diCal; source code can be downloaded at http://sourceforge.net/projects/dical/.

2 Linear-Time Computation of the Forward and Backward Probabilities

We consider a coalescent HMM with hidden states and observations . For PSMC, is the discretized time interval in which two homologous chromosomes coalesce at locus , while is an indicator for heterozygosity. The method diCal is based on the conditional sampling distribution (CSD) which describes the probability of observing a newly sampled haplotype given a collection of already observed haplotypes. In diCal, the hidden state at locus is , where denotes the haplotype in the “trunk genealogy” (see [22]) with which coalesces at locus and denotes the discretized time interval of coalescence; the observation is the allele of haplotype at locus . For , diCal is equivalent to PSMC. In what follows, we present our algorithm in the context of diCal, but we note that the same underlying idea can be applied to other coalescent HMMs.

2.1 A linear-time forward algorithm

We use to denote the joint forward probability of observing the partial emitted sequence and the hidden state at locus . The probability of transitioning from state at locus to state at locus is denoted by , the stationary probability of state is denoted , and the emission probability of the observed allele given coalescence at onto haplotype with allele at locus is denoted by . When is obvious from the context, we sometimes use for . Explicit expressions for , , and in the context of our program diCal are given in [24].

The forward probabilities are computed using the recursion

(1)

which contains terms. Since there are also possibilities for , it should naively take time to compute the entire forward dynamic programming (DP) table . The key to achieving a speed-up is to factor (1) in a way that reflects the structure of the coalescent, exploiting the fact that many transitions between different hidden states have identical probabilities.

After a sampled lineage recombines at time between loci and , it will “float” backward in time from the recombination breakpoint until eventually coalescing with a trunk lineage chosen uniformly at random (Figure 1). This implies that whenever and , and exploiting this symmetry allows the forward table to be computed in time. This speed-up was already implemented in the algorithm described in Paul et al. [22].

Figure 1: (a). Here, we illustrate a transition from hidden state to hidden state that proceeds via recombination at time . The probability of this transition does not depend on the identity of the haplotype .(b). As a recombined lineage floats through time interval , it can either coalesce with the trunk (event ) or keep floating (event ) and eventually coalesce with the trunk in a more ancient time interval.

Another symmetry of the transition matrix, not exploited previously, can be found by decomposing the transition from locus to locus as a sequence of component events. In particular, let be the event that a recombination occurs during time interval , and let be the event that no recombination occurs between and . Then we have that

(2)

where if the event is true or otherwise. The factor corresponds to the probability that the sampled lineage coalesces with haplotype in the trunk genealogy.

If a recombination occurs in time interval , the sampled lineage will start to “float” freely back in time until it either coalesces in or floats into the next time interval (Figure 1). Specifically, we let denote the event where the sampled lineage recombines at or before and floats into , and denote the event where the recombined lineage coalesces back in interval . Noting that and are independent of whenever , and that coalescence happens as a Markov process backwards in time, we obtain

(3)

Explicit formulas (specific to the method diCal) for the above terms are provided in the appendix.

By combining (2) with (2.1) and then collecting terms in (1), we can remove the sum over when computing . In particular, we define additional forward probabilities

(4)
(5)
(6)

Then, (1) can be written as

(7)

This can be seen by noting that the first three terms in the sum correspond to the terms for , , and , respectively when putting together (1) and (2). Alternatively, (7) follows from directly considering the probabilistic interpretation of the terms as given by (4), (5), and (6).

The required values of and can be computed recursively using

(8)
(9)

with the base cases

Hence, using the recursions (7),  (8), and (9), it is possible to compute the entire forward DP table exactly in time.

2.2 A linear-time backward algorithm

The backward DP table can be also computed in time. Given the linear-time forward algorithm discussed in the previous section, the easiest way to compute the backward DP table is as follows: Let denote the reversed and let denote the hidden states for the HMM generating . Then, since the coalescent is reversible along the sequence,

3 Linear-Time EM via an Augmented HMM

The primary application of PSMC and diCal is parameter estimation, specifically the estimation of demographic parameters such as changing population sizes. This is done through a maximum likelihood framework with the expectation maximization (EM) algorithm. In this section, we describe how to speed up the EM algorithm to work in linear time.

3.1 The standard EM algorithm with time complexity

Let denote the parameters we wish to estimate, and denote the maximum likelihood estimate:

To find , we pick some initial value , and then iteratively solve for according to

where . The sequence is then guaranteed to converge to a local maximum of the surface .

Since forms an HMM, the joint likelihood can be written as

Letting denote the posterior expected number of loci where event occurs, and denote the total probability of observing , we then have

(10)

Note that we have to compute the term for every pair of states , which makes computing the EM objective function quadratic in the number of discretization time intervals, despite the fact that we computed the forward and backward tables in linear time.

3.2 A linear-time EM algorithm

By augmenting our HMM to condition on whether recombination occurred between loci and , the EM algorithm can be sped up to be linear in . We now describe this augmented HMM. Let denote our original HMM, with states and observations . Between loci and , define

Now let , and for . We let be the HMM with hidden variables , observations , transition probabilities , and emission probabilities . Note that the probability of observing the data is the same under and , i.e.,

and so we may find a local maximum of by applying the EM algorithm to the augmented HMM , instead of to the original HMM .

To compute the EM objective function for , we start by noting that the joint likelihood is

(11)

where we decomposed the joint likelihood into the initial probability, the emission probabilities, the transitions without recombination, and the transitions with recombination. We note that the initial probability can be decomposed as

(12)

and from (2.1), we decompose the product of transition recombination probabilities as

(13)

where . Figure 2 shows a graphical representation for the transitions of .

Figure 2: This diagram illustrates the flow of transition probabilities through the augmented HMM. Lineages may transition between different coalescence times at loci and by recombining and passing through the floating states represented by circles. Each interval contains three distinct floating states to capture the the dependence of recombination and coalescence probabilities on whether any of these events occur during the same time interval.

By plugging (12) and (13) into (11), then taking the posterior expected logarithm of (11), we obtain the EM objective function for :

(14)

where

(15)

The computation time for each of the posterior expectations and does not depend on ; full expressions are listed in the appendix. Hence, the number of operations needed to evaluate (14) is linear in .

We note another attractive property of (14). By decomposing the EM objective function into a sum of terms , we obtain a natural strategy for searching through the parameter space. In particular, one can attempt to find the of (14) by optimizing the one at a time in . In fact, for the problem of estimating changing population sizes, depends on almost entirely through a single parameter (the population size in interval ), and we pursue a strategy of iteratively solving for while holding the other coordinates of fixed, thus reducing a multivariate optimization problem into a sequence of univariate optimization problems.

Although both the linear and quadratic EM procedures are guaranteed to converge to local maxima of , they may have different rates of convergence, and may converge to different local maxima. The search paths of the two EM algorithms may differ for two reasons: first, the intermediate objective functions (10) and (14) are not equal, and secondly, as discussed above, we use different search strategies to find the optima of (10) and (14). We have no proven guarantee that either search should perform better than the other, but our observations indicate that the linear-time EM algorithm typically converges to a value of with a equal or higher value of than the quadratic-time algorithm, in a fraction of the time (see Figure 5 for an example).

4 Results

To confirm the decrease in runtime, we ran the linear-time diCal method on simulated data with  Mb of loci and haplotypes (in which case diCal is equivalent to PSMC), using discretization intervals. To simulate the data, we used ms [14] with a population-scaled recombination rate to generate an ARG, and then added mutations using a population-scaled mutation rate of and a finite-sites mutation matrix described in Sheehan et al. [24]. Figure 3 shows the time required to compute the table of forward probabilities. We also measured the time required for one EM iteration and then extrapolated to 20 iterations to approximate the time required to estimate an effective population size history (Figure 3). In both figures, the linear runtime of our new algorithm is apparent and significantly improves our ability to increase the number of discretization intervals.

To assess the gain in accuracy of population size estimates that is afforded by more discretization intervals, we ran both the linear- and quadratic-time methods on simulated data with haplotypes and  Mb. The conditional sampling distribution was used in a leave-one-out composite likelihood approach [24] in this experiment. To run each method for roughly the same amount of time ( hours), we used for the quadratic method and for the linear method. For both methods, we ran the EM for 20 iterations and inferred size change parameters. As measured by the PSMC error function, which integrates the absolute value of the difference between the true size function and the estimated size function [15], larger values of permit the inference of more accurate histories.

Figure 3: Runtime results on simulated data with  Mb and haplotypes, for varying number of discretization intervals. \subreffig:runtime_forward Runtime results (in minutes) for the forward computation. \subreffig:runtime_EM Runtime results (in hours) for the entire EM inference algorithm (20 iterations) extrapolated from the time for one iteration.
Figure 4: Effective population size change history results. The speedup from the linear method allows us to use a finer discretization () than the quadratic method () for about the same amount of runtime. \subreffig:simulated_results Results on simulated data with Mb and haplotypes. Using the quadratic method with , the error was 0.148. Using the linear method with , the error dropped to 0.079. \subreffig:real_results Results on European haplotypes over a  Mb region of chromosome 1. The out-of-Africa bottleneck is very apparent with , but is not as well characterized for .

We also ran our method on CEU haplotypes (Utah residents of European descent) sequenced during Phase I of the the 1000 Genomes Project [26] (Figure 4). We can see that for the quadratic method with , we are unable to fully characterize the out-of-Africa bottleneck. In the same amount of computational time, we can run the linear method with and easily capture this feature. The disagreement in the ancient past between the two methods is most likely due to diCal’s lack of power in the ancient past when there are not many coalescence events. Using a leave-one-out approach with haplotypes, the coalescence events in the ancient past tend to be few and unevenly spaced, resulting in a less confident inference.

Figure 5: Results on simulated data, using the same discretization for the linear and quadratic methods. Each method was run for 20 iterations. \subreffig:likelihood The log likelihood of the EM algorithm, plotted against time, for both the linear and quadratic methods. \subreffig:results_d16 Population size change history results for the linear and quadratic methods, run with the same discretization using and estimating parameters.

The runtime of the full EM algorithm depends on the convergence of the M-step, which can be variable. Occasionally we observed convergence issues for the quadratic method, which requires a multivariate optimization routine. For the linear method, we used the univariate Brent optimization routine from Apache Math Commons (http://commons.apache.org/proper/commons-math/), which converges quickly and to a large extent avoids local maxima.

To examine the convergence of the two EM algorithms, we ran the linear and quadratic methods on the simulated data with haplotypes and the same number of intervals . We examine the likelihoods in Figure 5. The linear method reaches parameter estimates of higher likelihood, although it is unclear whether the two methods have found different local maxima, or whether the quadratic method is approaching the same maximum more slowly. Figure 5 shows the inferred population sizes for each method, which although similar, are not identical.

We have also looked at the amount of memory required for each method, and although the difference is small, the linear method does require more memory to store the augmented forward and backward tables. A more thorough investigation of memory requirements will be important as the size of the data continues to increase.

5 Discussion

The improvement to diCal described in this paper will enable users to analyze larger datasets and infer more detailed demographic histories. This is especially important given that large datasets are needed to distinguish between histories with subtle or recent differences. By using samples of 10 haplotypes rather than 2, diCal v1.0 [24] was able to distinguish between histories that diverged from each other less than coalescent time units ago, in which period PSMC tends to exhibit runaway behavior and hence cannot produce reliable population size estimates. The faster algorithm described here can handle samples of 30 haplotypes with equivalent computing resources. Our results indicate that this improves the method’s ability to resolve rapid, recent demographic shifts.

In organisms where multiple sequenced genomes are not available, the resources freed up by HMM decoding could be used to avoid grouping sites into 100-locus bins. This binning technique is commonly used to improve the scalability of PSMC, but has the potential to downwardly bias coalescence time estimates in regions that contain more than one SNP per 100 bp.

In general, it is a difficult problem to choose the time discretization that can best achieve the goals of a particular data analysis, achieving high resolution during biologically interesting time periods without overfitting the available data. Sometimes it will be more fruitful to increase the sample size or sequence length than to refine the time discretization; an important avenue for future work will be tuning and to improve inference in humans and other organisms.

Another avenue for future work will be to develop augmented HMMs for coalescent models with population structure. Structure and speciation have been incorporated into several versions of CoalHMM and diCal, and the strategy presented in this paper could be used to speed these up, though a more elaborate network of hidden states will be required. We are hopeful that our new technique will help coalescent HMMs keep pace with the number and diversity of genomes being sequenced and tease apart the demographic patterns that differentiated them.

Acknowledgments. We are grateful to Matthias Steinrücken and other members of the Song group for helpful discussions. This research was supported in part by NSF Graduate Research Fellowships to K.H. and S.S., and by an NIH grant R01-GM094402 and a Packard Fellowship for Science and Engineering to Y.S.S.

Appendix A Appendix

a.1 Explicit Computation of Transition Probabilities

In Equations 2 and 12 from the main text, we decompose the transition probabilities and the stationary probability into the component terms , , , and . Here we give explicit formulae for these transition probabilities in terms of scaling factors that specify the relative effective population sizes within each time interval. These formulae are specific to the method diCal with variable population size but no population structure. Very similar computations could be used for diCal with population structure, as well as for PSMC, CoalHMM, and related methods.

In addition to , these formulae will include the recombination rate , scaled with respect to an implicit population size such that is the effective population size in interval and , where is the recombination rate per site per generation. Time intervals are defined with respect to a fixed sequence of time points , where the th time state is the interval between and . In addition, denotes the average number of lineages that are present at time in an -leaf coalescent tree, and is computed in [24].

We compute the components of the stationary and transition probabilities as follows: