Estimation of cell lineage trees by maximum-likelihood phylogenetics\corr
firstname.lastname@example.orgJF \email@example.comFAM \presentaddSeattle, Washington
CRISPR technology has enabled large-scale cell lineage tracing for complex multicellular organisms by mutating synthetic genomic barcodes during organismal development. However, these sophisticated biological tools currently use ad-hoc and outmoded computational methods to reconstruct the cell lineage tree from the mutated barcodes. Because these methods are agnostic to the biological mechanism, they are unable to take full advantage of the data’s structure. We propose a statistical model for the mutation process and develop a procedure to estimate the tree topology, branch lengths, and mutation parameters by iteratively applying penalized maximum likelihood estimation. In contrast to existing techniques, our method estimates time along each branch, rather than number of mutation events, thus providing a detailed account of tissue-type differentiation. Via simulations, we demonstrate that our method is substantially more accurate than existing approaches. Our reconstructed trees also better recapitulate known aspects of zebrafish development and reproduce similar results across fish replicates.
Recent advancements in genome editing with CRISPR (clustered regularly interspaced short palindromic repeats) have renewed interest in the construction of large-scale cell lineage trees for complex organisms [McKenna et al., 2016, Woodworth et al., 2017, Spanjaard et al., 2018, Schmidt et al., 2017]. These lineage-tracing technologies, such as the GESTALT method [McKenna et al., 2016] that we focus on here111Genome Editing of Synthetic Target Arrays for Lineage Tracing, inject Cas9 and single-guide RNA (sgRNA) into the embryo of a transgenic organism harboring an array of CRISPR/Cas9 targets separated by short linker sequences (barcodes). These barcodes accumulate mutations because Cas9 cuts are imperfectly repaired by non-homologous end joining (NHEJ) during development while Cas9 and sgRNA are available. The resulting mutations are passed from parent cell to daughter cell, which thereby encodes the ontogeny. Mutated barcodes are later sequenced from the organism, and computational phylogenetic methods are then used to estimate the cell lineage tree. Because these barcodes have great diversity, GESTALT provides researchers with rich data with the potential to reveal organism and disease development in high resolution.
The most common phylogenetic methods used to analyze GESTALT data are Camin-Sokal (C-S) parsimony [Camin and Sokal, 1965] and the neighbor-joining distance-based method [Saitou and Nei, 1987]. However these methods are blind to the operation of the GESTALT mutation process, so the accuracy of the estimated trees are poor [Salvador-Martínez et al., 2018]. In addition, existing methods supply branch length estimates in terms of an abstract notion of distance rather than time, limiting their interpretability. Therefore, these estimated trees only provide ordering information between nodes on the same lineage, but not for nodes on parallel lineages. In addition, C-S parsimony is unable to distinguish between equally parsimonious trees, so obtaining a single tree estimate is difficult in practice: We find over ten thousand parsimony-optimal trees for existing datasets. To address these challenges, we set out to develop a statistical model of the mutation process, allowing us to estimate branch lengths that correspond to time as well as the mutation parameters.
No appropriate likelihood model is currently available for GESTALT because CRISPR arrays violate many classical statistical phylogenetic assumptions. First, Cas9 enzymes may cut two targets simultaneously with the entire intervening sequence deleted during NHEJ. In addition, once the nucleotide sequence for a target is modified, Cas9 is no longer able to cut the target. Thus sites are not independent, the mutation process is irreversible, and cuts can introduce long insertion and/or deletions. In contrast, the classical phylogenetic assumptions are that individual nucleotide positions are independent and that the mutation process is reversible and only introduces point mutations [Felsenstein, 2004, Yang, 2014]. Finally, these types of methods assume that there are many independent observations — their estimates are unstable when the effective sample size is much smaller than the number of parameters to estimate [Goolsby, 2016, Adams and Collyer, 2018, Julien et al., 2018].
In this paper, we introduce GAPML (GESTALT analysis using penalized Maximum Likelihood), a statistical model for GESTALT and tree-estimation method (including topology and branch lengths) by an iterative procedure based on maximum likelihood estimation. We model barcode mutations as a two-step process: Targets are cut according to a continuous time Markov chain, immediately followed by random insertions or deletions of nucleotides (indels). Our method does not rely on the aforementioned assumptions. Instead we introduce the following assumptions tailored to the GESTALT setting:
(outermost-cut) an indel is introduced by cuts at the outermost cut sites
(target-rate) the cut rates only depend on which targets are active (i.e. able to be cut)
(indel-probability) the conditional probability that an indel is introduced only depends on which targets were cut.
From these assumptions, we show that the Markov process is “lumpable“ and the aggregated process is compatible with Felsenstein’s pruning algorithm, thereby enabling efficient computation of the likelihood [Kemeny and Laurie Snell, 1976, Felsenstein, 1981]. Since only a small number of barcodes are usually available in practice, we propose a regularization method on the branch length and mutation parameters to stabilize and improve estimates. Our method extends maximum-likelihood phylogenetic methods with branch-length penalties [Kim and Sanderson, 2008] to the setting where the tree topology is unknown.
We validate our method on simulated and empirical data. In simulations, our method is more accurate than current tree-estimation methods. In addition, we reconstruct cell lineage trees of transgenic zebrafish from McKenna et al.  and show that our trees better reflect the known biology of zebrafish development. Based on these results, we conclude that with appropriate statistical techniques it is possible to reconstruct an accurate cell lineage tree with current GESTALT technology, which addresses some concerns raised in Salvador-Martínez et al. . Our simulation engine and estimation method are available on Github (https://github.com/matsengrp/gestaltamania).
2.1 Brief description of our probabilistic GESTALT evolution model
We model the GESTALT barcode (see Figure 1) as a continuous time Markov chain where the state space is the set of all nucleotide sequences. A state transition is an instantaneous event where either (1) an unmodified target is cut then the repair process inserts/deletes nucleotides around the cut site, or (2) two unmodified targets are cut, the intervening sequence is removed, and the repair process inserts/deletes nucleotides around the cut sites. The transition rate between barcode sequences depends on the entire sequence and each target is associated with a separate cut rate. If multiple copies of the barcode are used, we assume the barcodes are on separate chromosomes or are sufficiently far apart that they act in an independent and identically distributed (iid) manner.
We use this Markov model for GESTALT barcodes evolving along a cell lineage tree where the vertices represent cell divisions and the edge lengths represent time between cell divisions. The full cell lineage tree describes the relationships of all cells in the organism. Since we only collect a small sample of all the cells, our goal is to recover the subtree describing the development of the observed sequences.
To estimate this subtree, our method needs to calculate the likelihood of possible trees and model parameters, which requires an enumeration of the possible barcodes at each internal node. However a full enumeration is infeasible. For example, a double cut (transition (2) described above) could remove one or more targets, which could have themselves been modified in an infinite number of possible ways before the double cut erased this history.
We have carefully chosen our assumptions to make this likelihood tractable yet maintain biological realism. Briefly, under the target-rate and indel-probability assumptions, we can group states together if they share the same set of unmodified targets to calculate the likelihood more efficiently, a property known more formally known as “lumpability” (Figure 2). Since the number of targets in a barcode is typically small (e.g. 10 targets per barcode in McKenna et al. , Schmidt et al. ), calculating the likelihood becomes computationally feasible. In addition, the outermost-cut assumption allows us to exclude many groups from the likelihood computation so that the number of enumerated groups at most internal nodes is typically linear in the number of unique indels.
2.2 A maximum-likelihood tree estimation procedure
We follow current best practice for maximum-likelihood phylogenetics by optimizing the tree and mutation parameters of our model using a hill-climbing iterative search over tree space. First, we initialize the tree topology by selecting a random parsimony-optimal tree from C-S parsimony. At each subtree prune and regraft (SPR) iteration, we select a random subtree and regraft where the penalized log likelihood is highest (Figure 2(c)). The method stops when the tree no longer changes. At each iteration, we only consider SPR moves that preserve the parsimony score as we have found that the parsimony-optimal trees tend to have the highest likelihoods (Figure 18). The entire algorithm is presented in Algorithm 1. We discuss some important details of our method below.
We maximize a penalized log likelihood as opposed to the unpenalized version since the latter tends to give unstable and inaccurate estimates when the dataset is generated by a small number of barcodes. In particular, the length of the leaf branches and the variance of the target rates tend to be overestimated in such settings. Thus we use a penalty function that discourages large differences in branch lengths and target cut rates. Penalization introduces a slight complication since certain candidate SPR moves have naturally larger penalties. In order to make the penalty comparable between candidate SPR moves, we randomly select a leaf in the subtree and apply the candidate SPR moves only to that single leaf. When scoring the SPR moves, the penalty is calculated for the shared subtree, i.e. the tree where we ignore the random leaf. Finally, we regraft the entire subtree where the penalized log likelihood is highest.
Our method is able to estimate the tree at a finer resolution than existing methods (Figure 2(a)). The most commonly used method, C-S parsimony, produces estimates at the coarsest resolution: For nodes where the ordering is ambiguous, the method simply groups them under a single parent node. This commonly results in tree estimates with many multifurcating nodes (nodes with 3+ children) that have ten or more children. Our method uses the estimated model parameters to estimate the order and time of ambiguous nodes by projecting the subtree onto the space of caterpillar trees (Figure 2(b)). By producing tree estimates at a finer resolution, our method allows researchers to learn more about the structure of the true cell lineage tree. In addition, taking advantage of the irreversibility property, we efficiently estimate the branch ordering within the caterpillar trees by solving a single optimization problem, rather than considering each possible ordering separately.
2.3 Simulation engine and results
We built a simulation engine of the GESTALT mutation process during embryonic development. Since cell divisions during embryonic development begin in a fast metasynchronous fashion and gradually become more asynchronous [Moody, 1998], the simulation engine generates a cell lineage tree by performing a sequence of synchronous cell divisions followed by a birth-death process where the birth rate decays with time. We mutate the barcode along this cell lineage tree according to our model of the GESTALT mutation process. The simulation engine can generate data that closely resembles the data collected from zebrafish embryos in McKenna et al.  (Figure 3(a)). We can input different barcode designs into the simulation engine to understand how they affect our ability to reconstruct the cell lineage tree.
We used our simulation engine to assess the validity and accuracy of the estimated model parameters and tree. Because our method infers branch lengths, we evaluate the accuracy using two metrics that include branch length information: BHV distance [Billera et al., 2001] and internal node height correlation (see Figure 15). We compare our method to a simpler model-free approach: estimating the tree topology using C-S parsimony [Camin and Sokal, 1965] or neighbor-joining (NJ) [Saitou and Nei, 1987] and then applying semiparametric rate smoothing (chronos in the R package ape) to estimate branch lengths [Sanderson, 2002]. We will refer to these two approaches as “CS+chronos” and “NJ+chronos.” We do not compare against the original tree estimates from C-S parsimony and neighbor-joining since those branch lengths correspond to edit distance and have very poor performance according to our two metrics. Our method consistently outperforms these alternative methods (Figure 3(c)). We note that previous in silico analyses of GESTALT measure accuracy in terms of the Robinson-Foulds (R-F) distance, which only depends on the tree topology [Salvador-Martínez et al., 2018]. However the R-F distance does not recognize that different tree topologies can be very similar depending on their branch lengths, and is therefore too coarse as a performance metric.
We find, based on the simulations, that our likelihood-based method improves in performance as the number of independent barcodes increases (Figures 3(b)). In a simulation with a six-target barcode, the estimated tree from a single barcode has internal node height correlation of 0.5 with the true tree whereas using four barcodes increases the correlation to 0.9. Even though other analyses of GESTALT have recommended increasing the number of targets in a single barcode to improve tree estimation [Salvador-Martínez et al., 2018], it is more effective to increase the number of targets by introducing independent (and identical) barcodes (Figure 16).
2.4 Improved zebrafish lineage reconstruction
To validate our method, we reconstructed cell lineages using our method and other tree-building methods on GESTALT data from zebrafish [McKenna et al., 2016]. As the true cell lineage tree is not known for zebrafish, we employed more indirect measures of validity. For each method, we asked (1) if similar conclusions could be made across different biological replicates and (2) if the tree estimates aligned with the known biology of zebrafish development. The dataset includes two adult zebrafish where cells were sampled from dissected organs. The organs were chosen to represent all germ layers: the brain and both eyes (ectodermal), the intestinal bulb and posterior intestine (endodermal), the heart and blood (mesodermal), and the gills (neural crest, with contributions from other germ layers). The heart was further divided into four samples— a piece of heart tissue, dissociated unsorted cells (DHCs), FACS- sorted GFP+ cardiomyocytes, and non-cardiomyocyte heart cells (NCs). In addition, datasets were collected from embryos before gastrulation (dome stage, 4.3 hours post-fertilization (hpf)), at pharyngula stage (30 hpf), and from early larvae (72 hpf), where the cell type assignments are unknown.
GAPML captures more similar developmental relationships between tissue types across the two adult fish replicates.
For each estimated tree, we calculated the distance between tissues — the average tree distance between a leaf of one tissue to the closest internal node leading to a leaf from the other tissue, weighted by the allele abundance (Figure 5). (All alleles that were found in the blood were removed since blood is found in all dissected organs and can confound the relationship between organs McKenna et al. .) Recall that all of the fitting procedures are completely agnostic to any tissue source or cell abundance information. For a good method, we expect the correlation between tissue distances from the two fish to be close to one. We tested if the correlations were significant by permuting the cell types and abundances in the estimated trees. The correlation was 0.770 () using our method whereas ‘CS+chronos’ and ‘NJ+chronos’ had correlations of 0.306 () and -0.325 (), respectively. One might be concerned that our method is consistent across fish replicates because it returns very similar trees regardless of the data. However, this is not the case: When we re-run our method with randomly permuted cell types and abundances, the average correlation between the tissue distances drops to zero.
2.4.1 GAPML estimates similar mutation parameters across fish replicates.
For each time point, the fish replicates were traced using the same GESTALT barcode and processed using the same experimental protocol (Table 5(a)). We compared the estimated target rates from our method to those estimated using a model-free empirical average approach where the estimated target cut rate is the proportion of times a cut was observed in that target in the set of unique observed indels. The average correlation between the estimated target rates from our method were much higher than that for the alternate approach (Figure 5(a)). In fact, we can also compare target cut rates between fish of different ages that share the same barcode, even if the experimental protocols are slightly different. The 4.3hpf and 3day fish share the same barcode version and we find that the target rate estimates are indeed similar (Figure 5(b)). Again, a possible concern is that our method may have high correlation because it outputs very similar values regardless of the data. However, the estimated target cut rates were different for fish with different barcode versions. More specifically, the 30hpf fish used version 6 of the GESTALT barcode whereas the other fish used version 7. Visually, the target rates look quite different between those in the 30hpf fish and the other fish with the version 7 barcode (Figure 5(b)). Calculating the pairwise correlations between the estimated rates in 30hpf versus 3day fish, the average correlation, 0.416, is quite low and the bootstrap 95% confidence interval, (0.046, 0.655), is very wide and nearly covers zero.
GAPML recovers both cell-type and germ-layer restriction.
It is well known that cells are pluripotent initially and specialize during development. To evaluate recovery of specialization by tissue type, we calculated the correlation between the estimated time of internal tree nodes and the number of descendant tissue types; to evaluate recovery of specialization by germ layer, we calculated the correlation between the estimated time of internal nodes and the number of germ layers represented at the leaves. (As before, all the estimation methods do not use the tissue origin and germ layer labels.) Since any tree should generally show a trend where parent nodes tend to have more descendant cell types than their children, we compared our tree estimate to the same tree but with random branch length assignments and randomly permuted tissue types. Our method estimated much higher correlations compared to these random trees (Table 6(a)). We show an example of the node times versus the number of descendant cell types and germ layers in Figure 6(b). The estimated correlations from the other methods tended to be closer to zero compared to those in GAPML in all cases, except when using ‘NJ + chronos‘ to analyze the second adult fish. However upon inspection, the correlation is high for ‘NJ + chronos‘ because it estimates that cells are pluripotent for over 90% of the fish’s life cycle and specialize during a small time slice at the very end.
2.5 Analysis of the zebrafish GESTALT data
In this section, we analyze the fitted trees of the adult zebrafish in more detail. Our primary goals are to (1) check if summaries concord with known zebrafish biology, (2) generate new hypotheses about zebrafish development, and (3) generate new hypotheses on how to improve the experimental procedure. Again, as our method is agnostic to the tissue types, our trees have no prior assumptions or particular biases regarding the relationships between cell types.
Here we focus on the ordering and relative length of events. We ignore absolute estimated times since our fitting procedure for a single barcode heavily penalizes large differences between branch lengths. Though this procedure aids estimation accuracy, it also heavily biases the absolute time estimates. Thus in the figures, we scale time to be between 0 to to draw focus away from the absolute times. (We anticipate the branch length estimates to be more accurate to improve with the GESTALT technology. According to our simulations, the absolute branch length estimates are much more reliable when the several barcodes are inserted into the organism.)
We begin with a coarse summary of the cell lineage tree: We plot the average distance between a leaf node of one tissue type to the most recent ancestor of each different tissue type (Figure 5). This matrix recapitulates some well-established facts about zebrafish development. For example, we estimate that tissue types from the endoderm and mesoderm tended to have shorter shared lineage distances; these tissue types tended to separate from the ectodermal tissues earliest. This signal potentially captures the migration of the mesoderm and endoderm through the blastopore, isolating them from the ectoderm [Solnica-Krezel, 2005]. In addition, previous studies have found that gills are formed when the anterior part of the intestine grows toward and fuses with the body integument [Shadrin and Ozernyuk, 2002]. The distance matrix shows a large proportion of gill cells dividing late from the other endoderm and mesoderm layers.
The distance matrix also shows that the GFP+ cardiomyocytes tend to be farthest away from other tissue types, which could be either a developmental signal or an artifact of the experimental protocol. GFP+ cardiomyocytes were sorted using fluorescence-activated cell (FACS) and this purity could drive their separation from the other more heterogeneous organ populations. An interesting biological speculation would be that the heart is the first organ to form during vertebrate embryo development and, in particular, the myocardial cells are the first to develop, driving this observed signal [Keegan et al., 2004]. These observations show GAPML’s improved lineage distance estimation provide a more refined measure of the developmental process, and as our simulations show, will only improve as experimental approaches becomes more sophisticated.
The full cell lineage tree estimated using GAPML for the first adult zebrafish provides significantly more detail than the Camin-Sokal parsimony tree inferred for the original McKenna et al.  publication (Figure 8). Our tree has estimated branch lengths whereas the branches were all unit-length in McKenna et al. . In addition, the bolded lines in our tree correspond to the caterpillar spines where we have estimated the ordering between children of multifurcating nodes. Since the original maximum parsimony tree estimate in McKenna et al.  contained many multifurcating nodes and our method converts any multifurcating node to a caterpillar tree, our final tree contains many caterpillar trees. The longest caterpillar spine in our estimated tree starts from the root node and connects all the major subtrees that share no indel tracts. As the zebrafish embryo rapidly divides from the single-cell stage, these initial CRISPR editing events establish the founding cell in each subtree. GAPML estimates the target cut rates to order the events along the caterpillar trees, an impossible task in the original Camin-Sokal multifurcating trees. Lastly, we observe that the last three subtrees at the end of this spine (farthest away from the root) are primarily composed of alleles only observed in the intestinal bulb and the posterior intestine. This concords with our understanding of zebrafish development: Of the dissected organs, the digestive tract is the last to fully differentiate at day four [Moody, 1998]. In aggregate, these examples again show the power of a refined lineage tree to establish new interesting biological questions and a refined map in which to answer them.
2.6 Analysis of GESTALT barcode mutation parameters
Finally, our model’s estimated target cut rate parameters (Table 1) provide an interesting resource when considering redesigns of the GESTALT barcode. Here we focus on the GESTALT barcode in the adult fish. The estimated target cut rates were very similar across the two fish replicates.
We estimated very different cut rates across the ten targets. Target 1 and 9 had the highest cut rates; target 3 had the lowest cut rate. The ratio between the highest and lowest cut rates is at least 10 in both fish, i.e. a deletion at target 1 is at least 10 times more likely to occur than at target 3. In terms of the tree estimation, the targets with high cut rates mainly help capture early cell divisions whereas targets with low cut rates tend to capture late cell divisions. Having a broad spectrum of target cut rates is useful for capturing cell divisions throughout the tree, though the specific details depends on the true tree. Our simulation engine may be useful for understanding how variation in the target rates affects estimation accuracy under various conditions.
The double cut rate is similar across the fish. The rate is quite high: For the first adult fish, the double cut rate of 0.076 means that the probability of introducing a single-target indel as opposed to an inter-target indel in an unmodified barcode is 59%. To counter this, we can decrease the number of long inter-target deletions (and the number of masked events) by placing high cut-rate targets closer together in the barcode. One potentially helpful barcode design is to place the highest cut-rate targets in the center and the lowest cut-rate targets on the outside. Alternatively, designers could arrange the targets from highest to lowest cut rate. Table 1 shows that the current barcode design is counter to our suggestion, as the two targets with the highest cut rates in the two adult fish are targets 1 and 9.
The characterization of target efficiencies in a compact multi-target barcode is challenging problem. Our method can help steer the next generation of CRISPR-based lineage recording technologies to have increased recording capacity.
|Adult fish #1||Adult fish #2||3 day #1||30 hpf #5||4.3 hpf #1|
|Double cut rate||0.076||0.084||0.052||0.090||0.065|
|Left trim zero prob||0.015||0.028||0.015||0.258||0.025|
|Left trim length mean||6.330||6.634||5.165||12.000||6.405|
|Left trim length SD||4.956||5.184||4.245||6.633||4.998|
|Right trim zero prob||0.906||0.834||0.890||0.238||0.818|
|Right trim length mean||4.945||3.759||3.716||4.800||3.478|
|Right trim length SD||6.173||5.534||5.529||4.011||5.360|
|Insertion zero prob||0.400||0.411||0.401||0.520||0.419|
|Insertion length mean||5.085||4.540||5.786||5.446||4.589|
|Insertion length SD||5.798||5.533||5.219||7.291||4.708|
In this manuscript, we have proposed a statistical model for the mutation process for GESTALT, a new cell lineage tracing technology that inserts a synthetic barcode composed of CRISPR/Cas9 targets into the embryo. Our method, GAPML, estimates the cell lineage tree and the mutation parameters from the sequenced modified barcode. Unlike existing methods, our method estimates branch lengths and the ordering between children nodes that share the same parent. We demonstrate that our method outperforms existing methods on simulated data, provides more consistent results across biological replicates, and outputs trees that better concord with our understanding of developmental biology. We have answered the question "Is it possible to reconstruct an accurate cell lineage tree using CRISPR barcodes?” in Salvador-Martínez et al.  in the affirmative: The cell lineage tree can be estimated to a high degree of accuracy as long as appropriate methods are used.
Our method provides a number of technical contributions to the phylogenetics literature. The GESTALT mutation process violates many of the classical phylogenetic assumptions that existing methods rely upon for computational tractability. Thus we determined the most appropriate assumptions that are most suitable in this new setting and developed new methods so that the likelihood is computationally tractable and the estimated trees are accurate. We believe these techniques could be useful for other phylogenetic problems where the independent-site assumption does not hold. In addition, our methods may be useful as a jumping off point for analyzing other CRISPR-based cell lineage tracing technologies, such as that using homing CRISPR barcodes [Kalhor et al., 2017, 2018]. There are still many areas of improvements for the current method, such as quantifying the uncertainty of our estimates, estimating meta-properties about the cell lineage tree for organisms of the same species, and utilizing data sampled at multiple time points.
Finally, the biological results were relatively limited since the goal of this manuscript was primarily on methods development and the data in McKenna et al.  only provide tissue source information. In future work, we plan to apply our method to analyze data where each allele is paired with much richer information, such as single-cell gene expression data [Raj et al., 2018].
4 Materials and Methods
The data processed in this paper are all from McKenna et al.  and are available at the Gene Expression Omnibus under GSE81713. We use the aligned data where each allele was described with the observed insertion/deletions (indels) at each target. Each CRISPR target can only be modified once and indels can only be introduced via a double-stranded break at a target cut site, so we further processed the aligned data accordingly: merging indels if there were more than one associated with a given target, and extending the deletion lengths and insertion sequences so that a target cut site was nested within each indel. For this paper, we assume that the processed data is correct and do not attempt to model the effects of alignment error.
4.2 Methodological contributions
Here we highlight the key methodological contributions that we needed to develop in order to analyze GESTALT. We needed to develop new methodology because the GESTALT mutation process violates many classical assumptions in phylogenetics.
Since GESTALT is a new technology, we introduce new mathematical abstractions for the biological process. We then consider all statistical models that satisfy our proposed set of assumptions. We carefully designed these assumptions to balance biologically realism and computationally feasibility. We next show that such models are “lumpable” and use this to efficiently calculate the likelihood. Even though lumpability has been used to reduce computation time for general Markov chains, this idea is rarely used in phylogenetics. We show how lumpability can be combined with Felsenstein’s algorithm if the mutation process is irreversible.
In addition, our method estimates trees at a finer resolution compared to other methods, which leads to better tree estimates. In particular, we resolve the multifurcations as a caterpillar tree and show how to efficiently search over caterpillar tree topologies. As far as we know, this is one of the few methods that tunes the tree topology, which is typically treated as a combinatorial optimization problem, by solving a continuous optimization problem. The closest equivalent appears in the Bayesian phylogenetics literature where local changes to the topology may be introduced via a Subtree-Slide operator [Hohna et al., 2008].
To handle the small number of barcodes, we improve the stability of the method by maximizing the penalized log likelihood. Previous phylogenetic methods that penalize branch lengths assume that the tree topology is known [Kim and Sanderson, 2008, Zhang et al., 2018]. However, the tree topology is unknown in GESTALT. We show how penalization can be combined with tree topology search methods. Combining the two is not trivial since a naïve approach will bias the search towards incorrect topologies.
Finally, we use an automatic differentiation framework to optimize the phylogenetic likelihood. Automatic differentiation software has accelerated deep learning research since they allow researchers to quickly iterate on their models [Baydin et al., 2018]. Likewise, we found that this tool greatly accelerated our progress and, through this experiment, we believe that this tool may greatly accelerate the development of maximum likelihood estimation methods in phylogenetics.
4.3 GESTALT framework and definitions
In this section, we present mathematical definitions for the many components in GESTALT, though begin in this paragraph by giving an overview in words. We begin with defining the barcode and the individual targets within it. A barcode is mutated when nucleotides are inserted and/or deleted, which is referred to as an indel tract. We then define a possibly-mutated barcode or allele as a collection of the observed indel tracts. Finally, we use all these abstractions to define the barcode mutation process, which is framed as stochastic process where the state space corresponds to all possible alleles and the transitions correspond to indel tracts. To aid the reader, Table 2 briefly summarizes the definitions used in the paper.
The unmodified barcode is a nucleotide sequence where disjoint subsequences are designated as targets (Figure 8(a)). The targets are numbered from 1 to from left to right and the positions spanned by target are specified by the set . Each target is associated with a single cut site .
A barcode can be modified by the introduction of an indel tract. An indel tract, denoted by , is a mutation event in which targets and are cut (), positions in the unmodified barcode are deleted, and a nucleotide sequence is inserted. If , only a single target is cut. When , then no positions are deleted. If is of length zero, then no nucleotides are inserted. We only consider indel tracts that modify the sequence, i.e. either or has positive length, and nest at least one target cut site between positions and .
A possibly-modified barcode, also referred to as an allele, is a sequence of disjoint indel tracts associated with a single barcode (Figure 8(b))
where and for . The positions in the indel tracts always refer to the positions in the original unmodified barcode: deletions and insertions do not change the indexing. Let be the set of all possible alleles.
A target is active in allele if none of the nucleotides in are modified. That is, the status of target in allele is
So is 0 if target is active and 1 if it is inactive. For convenience, denote the target status of allele as
We now introduce the rules governing how alleles change through the introduction of indel tracts. First, transitions between possible allele states are constrained by the status of the targets. For a given allele , we can introduce the indel tract if and only if targets and are active. Note that the set of transitions allowed under this rule is a superset of biologically-plausible transitions. For example, even if position is deleted, introducing indel tract is formally allowed in our scheme. However in order to exclude biologically-implausible transitions, our models assign near-zero probability to such transitions.
Let be the resulting allele from introducing indel tract into allele . If indel tract does not overlap any other indel tract in , then is simply the union . If completely masks other indel tracts, i.e. all indel tracts in are either completely within the range of to or are completely outside of the range, then is the resulting allele after removing the masked indel tracts and adding . The last possibility is that is adjacent or overlaps, but does not fully mask, other indel tract(s) in ; then is the resulting allele after properly merging overlapping indel tracts (Figure 10). From a biological perspective, it is impossible to introduce overlapping but non-masking indel tracts. However the likelihood model is much simpler if we allow such events to happen. Since deletion lengths tend to be short in the empirical data (75% quantile = 10, McKenna et al. ), we will estimate that long deletion lengths occur with small probability, which means that these overlapping but non-masking indel tracts have very small probabilities. Thus we believe our model closely approximates the GESTALT mutation process. We will also discuss an assumption that removes these overlapping indel tracts from the likelihood calculation entirely in more detail in the following section.
4.4 GESTALT Model and Assumptions
|Markov process along branch ending with node|
|Markov process along branches ending with nodes in set|
|Observed allele at leaf|
|Observed alleles at leaves in set|
|Positions of target in the unmodified barcode|
|Cut site of target|
|Leaves of node|
|Descendants of node|
|Status of targets in allele . 1 in position indicates target is inactive||(2)|
|Indel tract that cuts targets and , deletes positions to , inserts|
|Target tract: all indel tracts that cut targets and and deactivate targets to||(5)|
|The target tract that indel tract belongs to|
|Wildcard: any indel tract that only deactivates targets with indices to||(8)|
|Singleton-wildcard: union of an indel tract and its inner wildcard||(9)|
|The set of likely ancestral states for node||(7)|
Here we define the GESTALT model more concretely and formalize the model assumptions presented in the Introduction. Recall the three model assumptions. First, as we rarely observe very long deletion lengths in the GESTALT data, we assume that indel tracts are introduced by cuts at the outermost cut sites (assumption A). In addition, based on our biological understanding of the CRISPR/Cas9 mutation process, we assume that mutations occur in two stages: first the targets are cut according to the rates of the active targets (assumption B) and then nucleotides are deleted/inserted according to a process that only depends on which targets were cut (assumption C). Figure 11 presents a flowchart of how the assumptions are used to derive later results.
The barcode mutation process up to time is formulated as a continuous time Markov chain with state space . As mentioned before, the choice of using state space implicitly assumes that indel tracts are introduced instantaneously, i.e. nucleotides are inserted and/or deleted immediately after target(s) are cut.
For a given tree , let be the length of the branch ending with node and let be the Markov process along the branch. In addition, let be the allele observed at that leaf, and let denote all leaves with ancestral node . The set of leaves in the entire tree is denoted . As notational shorthand, we denote the Markov process for the branches with end nodes in the set as . In addition, the observed alleles in the leaf set are denoted .
If there are multiple barcodes, we use the notation to represent the process for the th barcode and to represent the allele observed on the th barcode. In the manuscript, we assume the barcodes are independently and identically distributed. Therefore we will typically discuss the model and assumptions in the context of a single barcode and omit the index of the barcode.
Unfortunately, calculating the likelihood of the tree for a general model where the mutation rate depends on the entire sequence is computationally intractable: We would need to compute the transition rates between an infinite number of barcode states. Instead we introduce a model assumption so that we can aggregate the possible barcode states into lumped states indexed by target statuses. Then we can compute the likelihood using transition matrices of dimension at most . As is typically small ( in McKenna et al. ), the assumption makes the likelihood computationally feasible.
First we formalize the outermost-cuts assumption, which states that the cuts for an indel tract occur at the outermost cut sites. We define this mathematically by requiring that for any indel tract where targets and are cut, the deletions to the left and right cannot extend past the cut site of the neighboring targets and .
Assumption 1 (outermost-cuts).
All indel tracts are of the form where
This assumption limits the possible mutation histories of the alleles. Note that Assumption 1 still allows indel tracts to deactivate targets immediately neighboring the cut site. That is, an indel tract can either have a short deletion to the left so that target is unaffected or a long deletion to the left such that target is deactivated, i.e.
We can have similar short and long deletions to the right.
For the second assumption, let us introduce the concept of a target tract, which is a set of indel tracts that cut the same target(s) and deactivate the same target(s). A target tract, denoted , is the set of all indel tracts that cut targets and and delete nucleotides such that targets through are inactive, i.e.
Note that we always have that . Each indel tract belongs to a single target tract; we denote its associated target tract by .
This second assumption decomposes the mutation process into a two-step process where targets are cut and then indels are introduced; and combines the target-rate and indel-probability assumptions. In particular, the assumption states that the instantaneous rate of introducing an indel tract can be factorized into the rate of introducing any indel tract from a target tract, which depends on the target status of the current allele, and the conditional probability of introducing an indel tract which only depends on the target tract.
Assumption 2 (rate-rate, indel-probability).
Let be any allele, be any indel tract that can be introduced into , and be the target tract . The instantaneous rate of introducing indel tract in allele can be factorized into two terms: first, a function that only depends on , , and time , then second, the conditional probability of introducing given :
Thus the instantaneous rate of introducing only depends the allele through its target status. Using this assumption, we will show that the mutation process is equivalent to a continuous time Markov chain where we lump together possible allele states that share the same target status.
4.5 Likelihood approximation: summing over likely ancestral states
The likelihood of a given tree and mutation parameters is the sum of the probability of the data over all possible mutation histories. There are many possible ancestral states since inter-target deletions can mask previously-introduced indel tracts. In this section, we present an approximation of the likelihood that only sums over the likely ancestral states and ignores those with very small probabilities. We also provide a simple algorithm that efficiently specifies the set of these likely states, which is useful when we actually implement the (approximate) likelihood calculation.
We first address the problem that inter-target indel tracts have too many possible histories for brute-force computation. Not only can inter-target deletions mask previously-introduced indel tracts, but they can also result from merging overlapping multiple indel tracts. To simplify the likelihood calculation, we ignore any mutation history where indel tracts merge. We are motivated to do so by observing that indel tracts rarely merge – merging only occurs when many nucleotides are deleted whereas we mostly observe short deletions in the experimental data. Thus, mutation histories involving merging indel tracts probably contribute a negligible fraction to the total likelihood. The approximation is therefore as follows:
We approximate the likelihood with the probability of the data over all possible mutation histories where no two indel tracts are ever merged:
Note that this approximation strictly lower bounds the full likelihood since we sum over a subset of the possible histories.
Based on Approximation 1, we define a partial ordering between alleles. Given two alleles , we use to indicate that can transition to the allele without merging indel tracts, i.e. there is a sequence of indel tracts such that
where no indel tracts merge.
Now we present an algorithm to concisely express the set of alleles that are summed over at each internal node under Approximation 1. For the tree and observed alleles , define the set of likely ancestral states at each internal node as
We define over leaf nodes in the same way as (7), but we interpret this set as the alleles that likely preceded the observed allele. Henceforth, we use the shorthand notation whenever the context is clear. The approximate likelihood from Approximation 1 is equal to summing over at each internal node .
To construct the sets , we also need sets of indel tracts of the following forms (Figure 11(a)):
wildcard222In software systems, a wildcard is a symbol used to represent one or more characters (e.g. “*”). Similarly, we define wildcard here as all indel tracts that only deactivate targets within a specified range. : the set of all indel tracts that only deactivate targets within the range to
singleton-wildcard : the union of the singleton set and its inner wildcard, if it exists:
The singleton of singleton-wildcard (9) refers to and the inner wildcard of a singleton-wildcard refers to if it exists and otherwise.
Two singleton-wildcards are disjoint if the range of their positions don’t overlap. A wildcard is disjoint from a wildcard/singleton-wildcard if the range of their targets don’t overlap.
Given a set of indel tracts , let the alleles generated by , denoted , be the set of alleles that can be created using subsets of , i.e. is
We are interested in wildcards and singleton-wildcards because for any leaf node with indel tracts for , a superset of is the alleles generated by the union of their corresponding singleton-wildcards, i.e.
In fact, the following lemma states that we can use a recursive algorithm to compute supersets of . The algorithm starts at the leaves and proceeds up towards the root. We first use (10) to compute supersets of for all leaf nodes . Now consider any internal node with two (direct) children nodes and . For mathematical induction, suppose that we have already computed supersets of for and that are the alleles generated by unions of wildcards/singleton-wildcards. We compute the superset of as the alleles generated by the nonempty pairwise intersections of wildcards and/or singleton-wildcards corresponding to and . Since the intersection of wildcards and/or singleton-wildcards is always a wildcard or singleton-wildcard, we can also write this superset as the alleles generated by the union of disjoint wildcards and/or singleton-wildcards. We can repeatedly intersect wildcards/singleton-wildcards for internal nodes with multiple children nodes. The proof for the lemma is straightforward so we omit it here.
Consider any internal node with children nodes , …, . Suppose for each child , we have that
where are pairwise disjoint wildcards and/or singleton-wildcards. Then, we can also write in the form of (11) where are the non-empty intersections of , i.e.
For efficiency reasons, we are not satisfied with computing supersets of ; rather, we would like to concisely express the set of alleles that is exactly equal to . The only case in which the algorithm computes a strict superset of is when the alelles observed at the leaves of imply that the observed indel tracts must be introduced in a particular order. For example, if an allele has indel tracts and , we know that must be introduced before if cuts target and deactivates target (Figure 11(b)). Due to this ordering, we may find that the same indel tract observed in two alleles must have been introduced independently (also known as homoplasy in the phylogenetics literature). To indicate such orderings, we use the notation to denote that “if indel tract is in allele , then indel tract must also be in .“ The set of alleles respecting this ordering constraint is denoted
Per this definition, contains all alleles that do not include .
The following lemma builds on Lemma 1 and computes the sets exactly equal to . The algorithm is similar as before, but the parent nodes also adopt ordering requirements from their children nodes. Note that ordering requirements only ever involve observed indel tracts. Again, the proof for the lemma is straightforward so we omit it here.
For any leaf node , suppose its observed allele is for some , where . Denote its list of ordering requirements as
Now that we’ve shown that can be written in terms of disjoint wildcards and singleton-wildcards, we introduce one more notation that will be useful later. Define to be the singletons from the singleton-wildcards in .
4.6 Likelihood calculation: aggregating states
Here we show how to use the concept of “lumpability“ to calculate the approximate likelihood (6), even when there are an infinite number of ancestral states. Recall that we previously proposed calculating (6), which sums over a subset of all possible ancestral states. Though this has decreased the set of states to sum over, the number of ancestral states under consideration at each tree node is still infinite. Even if we ignore the insertion sequences, the number of possible ancestral states still grows at a rate of where is the number of positions in the unmodified barcode and the likelihood calculation has complexity since we need to construct transition matrices. Since the barcode is 300 nucleotides long in McKenna et al. , we cannot calculate the likelihood by considering all possible states separately.
To handle Markov processes with infinite state spaces, one solution is to partition the states into lumped states and show that the behavior of the original Markov process is equivalent to that of an aggregate Markov process over the lumped states [Kemeny and Laurie Snell, 1976, Hillston, 1995]. This property, called “lumpability,“ is defined as follows.
Let be a continuous time Markov chain with state space . If there exists a partition of and a continuous time Markov chain with state space such that
then is lumpable.
Although lumpability is a well-established technique for Markov chains, the practical difficulty is typically in constructing the appropriate partition [Ganguly et al., 2014].
There is relatively little work on applying these ideas of lumpability in phylogenetics. (The one application in Davydov et al.  calculates the likelihood of a codon model approximately by assuming states are lumpable, even though this is not necessarily true in their model; here we will show that the states are indeed lumpable.) Here we extend lumpability to the phylogenetics setting where we have different partitions of the state space at each tree node. In particular, for some indexing set , define a partition of at every node . Lumpability is only useful for efficient phylogenetic likelihood computation if these partitions are compatible with Felsenstein’s pruning algorithm [Felsenstein, 1981]. For any and node , let be the component of the likelihood for the subtree below for states in partition :
By Felsenstein’s algorithm, we have
For lumpability to be useful, we must be able to show that there exists an easy-to-compute weight function such that
and that for each tree node , is indeed lumpable over the partition . Obviously if the partitions are the same across all tree nodes, then we can just set all weights to one. However we will need to construct a different partition for each tree node for the GESTALT likelihood.
We propose partitioning at tree node based on whether or not the allele is a likely ancestral state (i.e. is it in ) and its target status (Figure 2):
Define the indexing set to be .
For internal tree node , partition the state space into
For leaf node , partition the state space into
To prove that the Markov process over the branch with end node is lumpable with respect to the proposed partition, we show that the instantaneous transition rate from any allele in to the set is the same. Therefore we use to denote the transition rates between the lumped states . The results show that there are two types of transitions between the lumped states, which determines the appropriate formula for calculating . Either the transition corresponds to an observed indel tract and there is only one indel tract that is a valid for transitioning between the lumped states; or the transition corresponds to a masked indel tract, in which case all indel tracts from the possible target tracts are valid transitions between the lumped states.
Suppose Assumption 2 holds. Consider any branch with child node , and target statuses where the sets and are nonempty. For any alleles , we have
If the only transition from an element in to is via the unique indel that deactivates the targets , then
where is defined in Assumption 2. Otherwise, we have
The instantaneous transition rates for an allele to the set is
If is an indel tract that can be introduced to the allele and has target status , then we can introduce the same indel tract to any other allele and will also have the target status . Therefore we have proven that (19) must hold for all .
To calculate the hazard rate between these lumped states, we rewrite the summation by grouping indel tracts with the same target tract:
One of the following two cases must be true:
From the decomposition (13) of , there is only one indel tract in the sets for such that for all . cannot be from a wildcard or the inner wildcard of a singleton-wildcard since this would contradict the fact that is the only indel tract in such that for all . Therefore must be the singleton for some singleton-wildcard . In other words, the only possible transition from to is via the indel tract .
Otherwise, for some target tract , there are at least two indel tracts in in the sets for that deactivate targets such that and for all . In this case, and must be from a wildcard or the inner wildcard of a singleton-wildcard ( and cannot both be from a singleton of a singleton-wildcard since ). Therefore every indel tract in satisfies for all .
Therefore (20) simplifies to
Note that to construct the entire instantaneous transition rate matrix of the aggregated process, we can easily calculate the total transition rate away from a target status and then calculate the transition rate to sink state using the fact that each row sums to zero. The transition rate away from is zero. ∎
We are finally ready to combine lumpability with Felsenstein’s pruning algorithm. The following theorem provides a recursive algorithm for calculating (6), using results from above.
Suppose the above model assumptions hold. Consider any tree node , target status , and nonempty allele group . Denote
If is an internal node, then
where is calculated using the instantaneous transition rates given in Lemma 3.
Since each node may have a different partition of the state space, we compute a separate instantaneous transition rate matrix for each branch. If we have multiple barcodes, we need to compute a separate matrix for each branch and for each barcode. Calculating the likelihood and its gradient can therefore become memory-intensive when there are many branches and/or barcodes. One way to reduce the amount of memory is to sum over subsets of instead. This is often reasonable since it is unlikely for the barcode to have many hidden events. For the analyses of the zebrafish data, we only sum over states that can be reached when at most one masked indel tract occurs along each branch. If there are more than 20 such states at a node, we only sum over the possible states when no masked indel tracts occur along that branch.
4.7 Caterpillar trees
As discussed in the main manuscript, we resolve the multifurcations in the tree as caterpillar trees to estimate the ordering of events. Recall that a caterpillar tree is a tree where all the leaves branch off of a central path, which we call the “caterpillar spine.“ Thus for each multifurcating node with children nodes , we resolve the multifurcation as a tree where the children nodes branch off of the caterpillar spine (Figure 2(b)). We do not consider all possible resolutions of the multifurcations since there are a super-exponential number of them and we likely do not have enough information to choose between all the possible trees (recall that they are parsimony-equivalent).
We need an efficient method to select the best ordering in each caterpillar tree since the number of possible orderings for children nodes is , which is also super-exponential. Since it is computationally intractable to calculate the likelihood for each ordering, we take an alternate approach where we introduce another approximation of the likelihood. This approximate likelihood can be computed using the same mathematical expression regardless of the ordering of the children nodes, which means we can tune over all possible orderings in the caterpillar trees by solving a single continuous optimization problem.
We approximate the likelihood by considering only the mutation histories that have a constant allele along the caterpillar spines:
To construct a mathematical expression for (23) that is independent of the ordering along caterpillar trees, we first re-parameterize the branch lengths for children of multifurcating nodes. For each child of a multifurcating node, let indicate the distance between the child node and the multifurcating node and indicate the proportion of distance that is located on the caterpillar spine (Figure 13). We can capture all possible orderings in a caterpillar tree by varying the values of these two sets of parameters across the children of a multifurcating node.