Fast and Scalable Inference of Multi-Sample Cancer Lineages
Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (ineage nference for ancer eterogeneity and volution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of SSNVs obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open-sourced and available at http://viq854.github.io/lichee.
]\initsVP\fnmVictoria \snmPopic ]\initsRS\fnmRaheleh \snmSalari ]\initsIH\fnmIman \snmHajirasouliha ]\initsDK\fnmDorna \snmKashef-Haghighi ]\initsRW\fnmRobert B \snmWest corref=aff1, email@example.com ]\initsSB\fnmSerafim \snmBatzoglou
Department of Computer Science, Stanford University, Stanford, CA
Department of Pathology, Stanford University School of Medicine, Stanford, CA
Cancer is driven by the accumulation of somatic mutations that confer fitness advantages to the tumor cells. Numerous studies have shown tumors to be highly heterogeneous, consisting of mixtures of cell subpopulations with distinct sets of somatic variants (for example see review papers [1, 2]). With the advent of next-generation sequencing technologies, many large-scale efforts are underway to catalog the somatic mutational events driving the progression of cancer [3, 4] and infer the phylogenetic relationships of tumor subclones. Characterizing the heterogeneity and inferring tumor phylogenies are key steps for developing targeted cancer therapies  and understanding the biology and progression of cancer.
In order to reconstruct tumor phylogenies, studies have utilized variant allele frequency (VAF) data of somatic single nucleotide variants (SSNVs) obtained by whole-genome [6, 7], exome , and targeted deep sequencing [6, 9]. Clustering of SSNVs based on VAF similarity [10, 11, 12] and detection of copy number aberrations, while accounting for variable sample purity [8, 13, 14], have been used to differentiate and order groups of mutational events. While many evolutionary studies of cancer have focused on single-sample intra-tumor heterogeneity , several studies have also compared multiple tumor samples extracted from a single patient either at different points in time during cancer progression [16, 17, 18] or from different regions of the same tumor or its metastases [7, 19, 20, 21, 22, 23]. In multi-sample approaches, the patterns of SSNV sharing (i.e., distinguishing somatic mutations that are omnipresent, partially shared, or private among the samples) can serve as phylogenetic markers from which lineage trees are reconstructed . On the basis of the lineage trees, the evolutionary timing of each mutational event can then be inferred with high confidence [7, 19, 25, 17].
Most existing multi-sample studies with a relatively small number of SSNVs infer the tumor phylogenies manually by analyzing SSNV VAFs and presence patterns across samples [7, 22, 26]. Several other studies used implementations of traditional phylogeny reconstruction methods, such as neighbor joining with Pearson correlation distances , or maximum parsimony  on patterns of somatic mutational sharing across samples. However, in order to scale to datasets comprised of large numbers of samples per patient and extract fine-grained SSNV timing information, as well as handle sample heterogeneity, which traditional tree-building techniques are not designed to do, specialized computational approaches need to be developed for tumor cell lineage reconstruction. Two recent computational methods, SubcloneSeeker  and PhyloSub , have been developed to address this need. The method SubcloneSeeker requires as input clusters of variant cell prevalence (CP) estimates that need to be obtained by processing the data using other existing tools. Given the CP clusters, the method generates all possible subclone structures in each sample separately. The per-sample solutions are then trimmed by checking their compatibilities during the merge step. However, the merge step is designed to check compatibilities of two tumor samples only (e.g. relapse/primary tumor sample pairs that are common in clinical studies) and it cannot merge the subclone structures of more than two samples. Currently it only reports which sample trees are compatible across the given pair of samples. The method PhyloSub performs reasonably on samples with very few mutations that form simple (chain) topologies; however, it produces unsatisfactory results on larger multi-sample datasets, such as  (see Appendix A for details).
In this work we introduce LICHeE (ineage nference for ancer eterogeneity and volution), a novel computational method for the reconstruction of multi-sample tumor phylogenies and tumor subclone decomposition from targeted deep sequencing SSNV datasets. Given SSNV VAFs from multiple samples, LICHeE finds the set of lineage trees that are consistent with the SSNV presence patterns and VAFs within each sample and are valid under the cell division process. Given each such tree, LICHeE provides estimates of the subclonal mixtures of the samples by inferring sample heterogeneity simultaneously with phylogenetic cell lineage tree reconstruction. LICHeE is able to search for lineage trees very efficiently by incorporating the SSNVs into an evolutionary constraint network that embeds all such trees and applying VAF constraints to reduce the search space. LICHeE runs in only a few seconds given hundreds of input SSNVs and does not require data preprocessing.
We demonstrate that LICHeE is highly effective in reconstructing the lineage trees and sample heterogeneity by evaluating it on simulated trees of heterogeneous cancer cell lineage evolution, as well as on three recently published ultra-deep sequencing multi-sample datasets of clear cell renal cell carcinoma (ccRCC) by Gerlinger et. al , high-grade serous ovarian cancer (HGSC) by Bashashati et. al , and breast cancer xenoengraftment in immunodeficient mice by Eirew et. al , for which single-cell validation results are also available. LICHeE found unique trees for each ccRCC patient and for all except one patient (for which multiple valid trees were found) of the HGSC study. For the ccRCC dataset, LICHeE trees were nearly identical to the published trees, which are the result of a multi-step thorough analysis of the data, involving SSNV calling, clustering, and tree-building using maximum parsimony. For the HGSC dataset, LICHeE improved on the results reported by the study, producing trees with better support from the data. LICHeE also revealed additional heterogeneity in the samples of both studies. In particular, LICHeE identified subclones in one more sample of the ccRCC study (in addition to the reported six samples) and three samples of the HGSC study, all supported by the data. Finally, the trees reconstructed by LICHeE on the xenoengraftment dataset can be highly-validated by the single-cell analysis presented in the paper. LICHeE is open-source and freely distributed at , and includes an intuitive graphical user interface (GUI) that may aid users to perform quality control on the output trees as well as interpret the trees biologically.
Results and Discussion
Overview of the Multi-Sample Cancer Phylogeny Inference Method, LICHeE
LICHeE is a method designed to reconstruct cancer cell lineages using SSNVs from multiple related normal and tumor samples of individual cancer patients, allowing for heterogeneity within each sample. Given a set of validated deeply sequenced SSNVs, LICHeE uses the presence patterns of SSNVs across samples and their VAFs as lineage markers by relying on the perfect phylogeny model . This model assumes that mutations do not recur independently in different cells; hence, cells sharing the same mutation must have inherited it from a common ancestral cell. This assumption can be used to derive the following SSNV ordering constraints. Firstly, (1) a mutation present in a given set of samples cannot be a successor of a mutation that is present in a smaller subset of these samples, since it could not have arisen independently by chance in the additional samples. Similarly, (2) a given mutation cannot have a VAF higher than that of its predecessor mutation (except due to CNVs), since all cells containing this mutation will also contain the predecessor. Finally, (3) the sum of the VAFs of mutations disjointly present in distinct subclones cannot exceed the VAF of a common predecessor mutation present in these subclones, since the subclones with the descendent mutations must contain the parent mutations (this constraint is formally defined in the Methods section). These constraints provide key information about the topology of the true cell lineage tree and are leveraged by LICHeE to define the search space of the possible underlying lineage trees and evaluate the validity of the resulting topologies. The final goal of LICHeE is to find phylogenetic trees encoding an evolutionary ordering of the input SSNVs that does not violate any of these three constraints.
At a high level, the LICHeE algorithm can be broken down into the following main steps (Figure 1). First, LICHeE partitions SSNVs into groups based on their occurrence in each sample, such that each group stores all the mutations that were called in the same subset of samples. To separate subclone lineages, the SSNV members of each resulting group are then further clustered based on their VAFs, such that SSNVs with similar VAFs across samples are clustered together. The final lineage tree needs to provide a valid ordering of these resulting SSNV clusters (i.e. an ordering that does not violate the three constrains defined above). In order to find such a tree, we construct an evolutionary constraint network, which encodes whether a given cluster of SSNVs could have preceded another, for each pair of clusters. More specifically, this network is an acyclic directed graph (DAG) that has SSNV clusters as its nodes and whose edges encode possible predecessor relationships among the nodes’ mutations (i.e. an edge denotes that the mutations of a given pair of clusters satisfy ordering constraints (1) and (2)). This network greatly reduces the search space of possible valid trees and allows us to formulate the task of inferring such trees as a search for spanning trees of the network that satisfy constraint (3) (within a given error margin), which ensures that the reconstructed trees are composed of parent-daughter edges that exhibit somatic VAF consistency with the cell lineage expansion. If multiple valid lineage trees are found during the search, the trees are ranked based on how well they support the cluster VAF data, the top-ranking tree minimizing the use of the permitted error margin (see Methods for details). Finally, as shown in Figure 1, the leaves of the resulting trees are the individual samples (added post-search), whose composition can be reconstructed by tracing back their respective subclone cell lineages in the tree. We detail each of these steps in Methods.
Since finding true SSNV groups is a crucial step of the algorithm, it is important to minimize false positive and false negative SSNV calls across samples. However, accurately detecting SSNVs in each sample is a challenging task due to high levels of noise in the data, which can come from various sources, such as sequencing errors, systematic amplification bias, mapping errors, and sample impurities. Multiple techniques have been developed to address this problem to date [33, 34, 35, 36, 37], many employing a Bayesian approach to model the distributions of noise and true genotypes in matched-normal samples. LICHeE can work with variant calls produced for each sample by any specialized existing method. However, it does not require the users to pre-process the data using these tools, providing its own heuristic mechanism to call SSNVs using the multi-sample VAF data. At a high level, it first finds SSNVs that can be called reliably in each sample using two hard thresholds and , above and below which, respectively, the SSNV are considered robustly present or absent. Then, assuming that the presence patterns of such SSNVs capture most of the topology of the true underlying evolutionary tree, it uses this inferred tree to inform the group assignment of the SSNVs whose VAF falls in between the thresholds (the ’greyzone’).
Currently LICHeE does not automatically detect or incorporate the CNVs explicitly into the model, although the method can still find valid phylogenies even in the presence of SSNVs within such regions (for example, each patient in the ccRCC dataset had numerous variants from CNV regions). In order to address this limitation, LICHeE also accepts cell prevalence (CP) values instead of VAFs, which can be computed by several recently-developed tools, such as PyClone , ABSOLUTE , and ASCAT , and account for CNVs, LOH status, and sample purity. The same algorithmic steps can then be directly applied to CP values of each variant. It can be easily seen that the three perfect phylogeny ordering constraints still hold for CP values and can be used to search for the underlying lineage tree. Furthermore, in order to support outputs of specialized clustering approaches (e.g. PyClone ), LICHeE also accepts already computed clusters of mutations (with given CP and VAF-based centroid values) and uses these clusters as nodes of the phylogenetic constraint network.
We evaluated LICHeE on three recently published ultra-deep sequencing multi-sample datasets of clear cell renal cell carcinoma (ccRCC) by Gerlinger et. al , high-grade serous ovarian cancer (HGSC) by Bashashati et. al , and breast cancer xenoengraftment by Eirew et. al , as well as on simulated trees of heterogeneous cancer cell lineage evolution. On the ccRCC dataset, LICHeE constructs near-identical trees to the trees published in the study. We show that the interesting difference in the topology of one tree arises due to potential heterogeneity of a sample that cannot be discovered using the traditional maximum parsimony approach and analyze when this approach can fail to detect existing sample subclones. For each patient LICHeE finds a unique valid tree. On the HGSC datasets we show that the trees generated by LICHeE are better supported by the data and demonstrate why applying neighbor joining with Pearson correlation distance metric, used by the study, might not be suitable for cancer datasets. LICHeE finds a unique valid tree for all patients except Case5. Finally, we show that the trees inferred by LICHeE on the xenoengraftment dataset are consistent with the single-cell analysis done by the study. On each patient dataset LICHeE takes only a few seconds to run.
Lineage Tree Reconstruction on ccRCC Data
The ccRCC study by Gerlinger et. al  validated 602 nonsynonymous nucleotide substitutions and indels from multiple samples of 8 individuals. It used VAF-based clustering of each sample to detect subclones prior to determining the variant presence patterns used in tree reconstruction. The phylogenetic trees were then reconstructed using maximum parsimony, revealing a branched pattern of ccRCC evolution in all tumors. Figure 2 juxtaposes the trees produced by LICHeE with the trees presented by Gerlinger et. al  (for details about the parameters used by LICHeE and the ccRCC dataset see Appendix B). The trees presented in the figure have been redrawn using AI. We can see that the trees generated by LICHeE are topologically identical (consisting of the same branches) to the published trees for patients EV005, EV007, RMH002, RMH008, and RK26. Furthermore, LICHeE identified subclones in all the samples reported to be heterogeneous by the study. In particular, it identified the following regions as a mixture of two subclones: R6 in EV005 (with frequencies of 0.29 and 0.04), R3 and R9 in EV007 (with frequencies of 0.15 and 0.03 and 0.21 and 0.03, respectively), R4 and R6 in RMH008 (with frequencies of 0.19 and 0.14 and 0.21 and 0.15, respectively), and R5 in RK26 (with frequencies of 0.15 and 0.03). These subclones correspond to the dominant (dom) and minor (min) shown in the published trees. Evidence supporting each subclone can be analyzed using the presented trees.
The trees generated for patients EV003 and EV006 also highly match the published results. For EV006, the LICHeE-generated tree does not contain the following two partially shared groups: (R2, LN1a, and LN1b) and (R3, R1, R4, R7, R15). For the first group we find no evidence in the data – no SSNVs are shared in these three samples and absent from the others. We do find one mutation that supports the second group. Because, by default, LICHeE eliminates nodes that have evidence from only one SSNV, this group is not shown in our tree.
The RMH004 dataset contains three partially overlapping groups (R3, VT, R10, R4, R2), (R3, VT, R10, R4, R8), and (R10, R4, R2, R8). These groups represent separate branches where their lowest common ancestor is the group with mutations present across all samples (R3, VT, R10, R4, R2, R8). However, the VAF of this parent group in sample R10 is 0.34, while the average VAF across each of the three groups is 0.32, 0.27, and 0.21, respectively. Therefore, no more than one of these groups can be a descendant of the parent group, without violating the VAF phylogenetic constraint. In order to generate a valid tree, the two least populated conflicting branches among the three are removed from the dataset. Similarly, these two groups are ignored by the maximum parsimony algorithm and are not present in the published tree. The difference between our tree and the published one comes from the mixed lineage we observe in sample R4. Due to the VAF of 0.17 of the group of R4 private mutations, the private group is not a descendant of the group shared by R4 and R2 since the average VAF of that node is 0.06, which is too small to be a parent to 0.17. Therefore, our method suggests two different subclones in R4. Since the group (R2, R4) is small (total of three mutations) and has a low frequency in R4, the evidence of the two subclones cannot be considered very strong; however, it does constitute a signal in the data for the R4 mixed lineage possibility.
While the ccRCC study does perform VAF clustering of each sample to find subclones at the onset, it runs the phylogenetic reconstruction using the presence pattern profiles only. On the other hand, our tree reconstruction with LICHeE applies the VAF constraint to the resulting tree topologies and clusters SSNVs in each group based on their VAFs across all the samples. For patient RMH004, applying the VAF constraint, reveals additional sample heterogeneity, for which we produced a potentially improved tree compared to the tree reported using maximum parsimony. It can also be shown that clustering across all the samples rather than each sample individually, is a more appropriate approach for revealing the heterogeneity in the data. For example, if two subclones occur with high and low VAFs in one sample but are uniform in another sample, single sample clustering will detect them only in the sample where they differ. For example, if the two subclones in samples R3 and R9 of patient EV007 (discussed above) had highly similar VAFs in these samples, clustering would not be able to differentiate the subclones. On the other hand, LICHeE would still be able to detect the mixed lineages in the two samples due to the presence of groups (R1, R2, R3, R5, R6, R7, R9) and (R3, R4, R9) in the data, which must form divergent branches in the lineage tree.
Lineage Tree Reconstruction on HGSC Data
We further evaluated LICHeE on the HGSC dataset from the study by Bashashati et. al. . This study validated 340 somatic mutations from 19 tumor samples of six patients. It used neighbor joining with Pearson correlation distances (computed on the binary sample presence patterns) to infer lineage trees. The trees generated by LICHeE juxtaposed with those presented in the paper are shown in Figure 3 (for details about the parameters used by LICHeE and the HGSC dataset see Appendix B). In four out of six cases (Case2, Case3, Case 4, and Case6) the possible tree topologies are very simple and the trees produced by the two methods are unsurprisingly highly similar. Below we discuss the difference between the two remaining trees.
For Case1 the tree reported in the paper suggests that sample d diverged first, followed by c. This suggests that there should be a group of mutations shared exclusively between samples a and b and another group shared between samples a, b, and c. However, examining the dataset VAF values, as well as the results of the binomial test determining SSNV presence in the samples used by this study, we found no evidence for these two groups. On the other hand, the tree produced by LICHeE suggests the presence of mutations shared by samples a, b, and d only, which we found both in the results of the study’s binomial test and by applying the LICHeE hard threshold caller.
For Case5, the study reports an early divergence of sample c. This suggests that there should be mutations shared between all the samples except c. We have confirmed that no such mutational profile exists in the data. On the other hand, the data shows the presence of mutations that exist in samples a, b, c, e, and f but not in d that cannot be supported by the reported tree. The reason why the neighbor-joining algorithm of the study chose sample c as the first diverging branch of the tree must be because of the large presence of private mutations in sample c, which led to a low Pearson correlation (and hence a greater distance) with the profiles of other samples. This shows that using the Pearson correlation metric is not suitable for this data. Furthermore, directly applying traditional phylogeny reconstruction techniques (e.g. neighbor joining) cannot reveal sample heterogeneity.
The tree produced by LICHeE for Case5, reveals mixed lineages in samples b, e, and f. Interestingly, LICHeE detects two clusters in group (a, b, c, e, f) with mean VAFs of [0.29, 0.34 0.3, 0.19, 0.4] and [0.17, 0.14, 0.13, 0.1, 0.13], respectively. The two subclones in each sample are then produced by the divergence of the higher VAF cluster into two branches: one branch containing the lower VAF cluster and the other branch containing the group of mutations (e, f, b). While the presence of group (e, f, b) is weak (supported by three mutations only), the presence of group (b, f) is substantial (supported by 14 mutations) and provides strong evidence for the mixed lineage of b and f. Furthermore, due to the presence of group (a, b), the group (b, f) cannot be assigned as a child of the low VAF cluster (a, b, c, e, f) without violating the VAF phylogenetic constraint. However, since there are only three mutations in group (a, b), eliminating this group may be a reasonable alternative reconstruction. Therefore, multiple topologies are possible in this scenario, each supported by varying degrees of evidence. LICHeE presents the tree that best fits the data given input parameters of expected noise level and minimum mutational support needed for a node. Using LICHeE with different parameters and interacting with the trees can allow users to explore other evidence existing in the data. It is important to point out that, as opposed to the results of the neighbor-joining algorithm that produces trees with lack of evidential support in some branches, each branch in a tree reported by LICHeE reflects the presence of the corresponding SSNV groups in the data.
Lineage Tree Reconstruction on Xenoengraftment Data
The study by Eirew et. al.  used deep-genome and single-cell sequencing to evaluate the clonal dynamics of xenoengraftment of breast cancer tissue into immunodeficient mice. The study applied PyClone  to infer mutational clusters representing clonal genotypes and then validated these results using single-cell analysis of two cases, SA494 and SA501. The study used Bayesian phylogenetic inference to reconstruct the evolutionary relationships between the single-cell nuclei. We applied LICHeE to the deep-sequencing VAF data of this study. We then compared the results of LICHeE to the single-cell phylogenetic trees and clonal genotypes for both SA494 and SA501.
Single-cell phylogenetic inference of SA501 passages X1, X2, and X4 reveals an ancestral genotype that branched into two sibling clades A and B. Sequential acquisition of additional mutations in clade B then gave rise to genotypes C, D, and E. Samples X1 and X2 were found to be a mixture of clones with genotypes A, B, C, and D; while, sample X4 was found to be dominated by genotype E and did not contain clones with genotype A (see Figure 2 in ). As can be seen in Figure 4a, the tree reconstructed by LICHeE mirrors the phylogeny and sample compositions revealed by the study. In particular, it presents the same two sibling clades derived from the ancestral genotype. Samples X1 and X2 both contain subclones with genotype A (with a higher percentage of this genotype in sample X2, as confirmed by the single-cell analysis), while this genotype is missing from sample X4. Similarly, genotype E is found to be private to and dominates sample X4. The only difference in the LICHeE tree is the collapse of genotypes B and C into one cluster; however, examining the bulk CP values reported by the study, we can see that the CP values of these genotypes are highly similar in the three given samples (X1, X2, and X4) and, as a result, cannot get subdivided into two separate clusters during the clustering step (note: the PyClone analysis of the study was simultaneously performed on three additional samples T, X3, and X5, which showed higher CP value differences).
The single-cell analysis of SA494 samples T and X4 reveals two clades, with mutually exclusive sets of mutations, emerging from an ancestral clone present in both samples (see Extended Data Figure 3 in ). LICHeE reconstructs the exact same topology (Figure 4b), showing two groups of mutations private to T and X4, respectively. In accordance with the results reported by PyClone, LICHeE also finds two clusters of mutations private to X4 (with the descendant cluster present in about 20% of the sample).
We developed a cancer cell lineage simulator to better asses the performance of LICHeE. Our simulator models cancer evolution from normal tissue producing a branching hierarchy of monoclonal cell populations in accordance with the branched-tree cancer evolution model [39, 26, 40]. Starting with the normal cell population, the simulator iteratively expands the cell lineage tree by introducing (with some given probability) new daughter cell populations corresponding to newly acquired SSNV or CNV events. In particular, in every iteration, each cell population present in the tree can give rise to a new population of cells (with a given randomly generated size) representing a new SSNV with probability or a CNV event with probability . Each cell population can also undergo a cell death event with probability . Each simulated SSNV is randomly associated with a genome location (chromosome and position) and haplotype; the CNVs are associated with a chromosome arm and haplotype and correspond to a duplication of this chromosome arm. For the evaluation, we generated 100 lineage trees, each expanded over 50 iterations, with the following parameters: ; , , and ; and . This process results in lineage trees with an arbitrary number of branches and nodes (several hundred to thousands of nodes on average). Figure 5 illustrates one such lineage tree.
Multiple samples are then collected from each lineage tree. Each sample consists of several cell populations (nodes) of the tree, where each such cell population represents a subclone in the sample. We implemented two sampling schemes: sampling and sampling. The randomized sampling process selects a random subset of nodes from the tree for each sample; on the other hand, the localized sampling process is meant to mimic biopsies from spatially distinct sites and selects nodes such that samples mostly contain subclones from distinct branches of the simulated tree. Localized sampling for samples is achieved by selecting disjoint subtrees in the simulated tree using an approach based on breadth-first search, which finds disjoint subtree roots as high on the tree as possible (if disjoint subtrees do not exist in the tree, the maximum number of subtrees is used and some samples are obtained from the same subtree in a round-robin fashion). Figure 5 illustrates the localized sampling procedure for 10 samples.
Given a selected subset of cell populations, the sample is then created by obtaining a fraction of the cells from each population by sampling from a multinomial distribution with probabilities corresponding to the cell population sizes. For randomized sampling, we select up to 5 subclones for each sample. For localized sampling, there can be up to 5 subclones from the same distinct subtree, exactly one subclone from a neighboring subtree, and some fraction of normal cells to represent normal contamination (the fraction is randomly selected to be from 0 up to 20% of the sample). In order to determine how the performance of LICHeE degrades as the number of SSNVs located in CNV regions grows, we ran several experiments with increasing numbers of CNV events. In particular, we set = [0, 0.1, 0.18], which resulted in a total of 0%, 65%, and 80% of SSNVs to be affected by CNVs in the collected samples, respectively. Given the sample cell population counts, we can compute the VAF of each SSNV in this sample. In the absence of CNVs, the VAF of a given SSNV is simply the fraction of the cells containing the SSNV out of all the cells in the sample. When CNVs are present, we count the number of haplotypes containing the SSNV and the reference allele across all cells of the sample. Given an SSNV, we consider the chain of mutations present in each cell population affecting its genome position. Let be the number of haplotypes containing the SSNV in population and be the number of haplotypes containing the reference allele. The VAF of a given SSNV is then:
where is the total number of selected populations and is the number of cells selected from population .
Finally, given the true VAFs of each SSNV, we add sampling and sequencing noise to each value. In particular, in order to simulate reads covering each SSNV position, we sample the VAFs from the Binomial distribution , where is the true VAF of each SSNV and is the total simulated read coverage (100X, 1,000X, and 10,000X); the generated frequencies have a mean and variance . We simulate a Q30 (1 in 1000) base call sequencing accuracy.
Given the simulated VAFs, we first assessed the performance of LICHeE in SSNV calling. Table 1 presents LICHeE’s sensitivity in calling mutations across the samples (i.e. the number of SSNVs with correctly identified sample presence patterns out of all the simulated SSNVs) given true VAF values and for coverages of 100x, 1,000X, and 10,000X. As expected, higher coverage results in higher sensitivity. The method achieves sensitivity across all the experiments.
Next we compared the topology of the reported lineage trees and the simulated trees and measured LICHeE’s accuracy in reconstructing the ancestor-descendant and sibling relationships. For every pair of mutations in the simulated tumor hierarchy, we checked if LICHeE preserved the relationship between them in the generated top tree. More specifically, we checked for the following two types of violations: 1) if an ancestor-descendant relationship was inverted or became a sibling relationship, and 2) if a sibling relationship became an ancestor-descendant relationship. Tables 2 and 3 summarize the accuracy of these metrics. Since LICHeE may remove nodes from the network representing non-robust SSNV groups (as described in Methods) if no valid trees are found during the search, not all of the simulated mutations will be present in the final tree. We report the percentage of the SSNVs and of simulated ancestor-descendant (AD) and sibling (Sib) mutation pairs that are present in the tree. As expected, the number of mutations present in the trees decreases with higher numbers of input samples, lower coverage, and the presence of CNVs (which can significantly alter the VAF values, causing the violation of the VAF ordering constraints). In particular, with a low coverage of 100X and 15 samples, only 81% of SSNVs are preserved in the tree. Therefore, LICHeE is best applied to data with higher coverage when the number of input samples is high. For instance, with a higher coverage of 1,000X, 91-94% of SSNVs and 83-88% of mutation pairs are present in the trees with 15 samples (although this metric drops with the presence of CNVs to 91-92% SSNVs and 66-70% SSNV pairs).
Since LICHeE groups mutations with the same presence patterns across samples and similar VAFs, only the mutations occurring in a different set of samples or with significantly different VAFs will be placed in distinct nodes of the tree. We report the percentage of such ancestor-descendant pairs (AD-Ord) in Tables 2 and 3. We then evaluate how many ordered mutations preserved the correct ancestor-descendant relationship (AD-Corr). Across all the experiments without CNVs, we get 99-100% correctness (with less than 1% of pairs being in reversed order). We see 92-96% correctness in experiments with 80% of SSNVs located in CNV regions and 1000X coverage. AD mutations that were not ordered or grouped in the same node, will be siblings in the reconstructed tree (AD Sib). We find that the vast majority of such mutation pairs involve private mutations, whose placement in the tree is usually under-constrained (i.e. multiple tree nodes can serve as ancestors to such mutation groups). Finally, we can see that the reverse violation of sibling mutations being placed into ancestor-descendant nodes (SibAD) is also very rare (up to 7% across all the experiments). Therefore, we conclude that the trees reconstructed by LICHeE provide highly accurate ordering of the mutations in its nodes.
LICHeE has been designed to automatically infer cell lineages of multiple tumor samples and the sample subclone decomposition. Our analysis shows that LICHeE is highly effective in reconstructing the phylogenies and uncovering the heterogeneity of previously published datasets and in simulations, improving not only upon traditional tree-building methods, but also on recent developments specialized for cancer data. Currently LICHeE works with deep sequencing data that provides VAF estimates with low variance (as well as on CP values that can be obtained from existing tools and can correct for SCNAs, LOH, and sample purity). SSNV data obtained from deep whole-genome sequencing, targeted resequencing of informative SSNVs, or exome sequencing in tumors with a high degree of somatic SSNVs present in exomes, should be appropriate inputs to LICHeE. Several directions for future work are open: extension of this method to lower coverage whole-genome sequencing data and incorporation of aneuploidies and large CNVs directly into the model.
Grouping and Clustering SSNVs
When partitioning SSNVs into groups based on sample occurrence, each SSNV is first associated with a binary sample profile denoting its presence or absence in each sample. Given samples from an individual, the binary profile is defined as a binary sequence of length where the bit is set to 1 if this SSNV is called in the sample, and is 0 otherwise (e.g. given 5 samples, an SSNV with the profile 01011 is called in samples 2, 4, and 5). SSNVs with the same profile are assigned to the same SSNV group (e.g. a group with the profile 01101 will contain all the SSNVs occurring in samples 2, 3, and 5). The group with the profile consisting entirely of 1s will contain SSNVs that occur in all the samples and are, therefore, germline variants (assuming that the sample set includes a normal control sample).
The SSNV binary profiles can be passed as input or computed from SSNV VAFs as follows. First two hard VAF thresholds, and are used to determine if an SSNV is robustly present or absent from a sample. An SSNV profile is classified as robust if its VAF is above or below the two thresholds, respectively, and if at least a minimum number of other robust SSNVs (default set to one) have the same binary profile. All other SSNVs are considered non-robust and are assigned to a group as follows. Given a non-robust SSNV , its VAF across samples can either fall below (marked 0), above (marked 1), or in between (marked *) the thresholds and , resulting in a profile such as 01*11. The candidate groups, to which can be assigned, must have an identical profile in all the samples that are marked 0 or 1 (e.g. for profile 01*11 two valid candidate groups are 01111 and 01011). Since can be assigned to more than one target group, we consider that the group containing a robust SSNV that is most similar in VAF to is the best candidate. The following metric is used to compute the similarity between two SSNVs and :
where is the VAF of in sample . If the maximum similarity is higher than a given threshold, is assigned to the group of . Unassigned non-robust SSNVs will form new profile groups. We minimize the number of such new groups by formulating this task as a set cover problem. In particular, let be the set of all unassigned SNVs. Denote as the set of subsets of , where each subset represents mutations that can be assigned to the same potential target group. The target groups are a list of all possible binary profiles that the SSNVs can be assigned to, obtained by substituting all *s by 0 or 1. We want to find the minimum number of target groups (i.e., smallest subset of ), which cover all mutations in . This problem is known to be NP-complete and searching for the exact solution is not feasible. Instead we apply the standard greedy algorithm by choosing (at each stage) the target group that covers the largest number of non-robust SSNVs. For SSNVs whose targets are not supported by any other mutations, we convert each * to 1 or 0 depending on whether the VAF is closest to or , respectively.
Once the SSNVs are partitioned into groups, the SSNVs of each group are further clustered based on their sample VAFs. Each SSNV group is associated with a matrix of VAFs of size , where is the number of SSNVs in this group and is the number of represented samples (e.g. for a group with the profile 0110001). The EM clustering algorithm for GMMs is run on the resulting VAF matrix using . The result is a set of SSNV clusters with an associated VAF centroid vector, . To handle the high variance in the VAF data due to noise, some of the resulting clusters are eliminated (based on size) or collapsed with neighboring clusters, based on the distance between their centroid vectors.
Evolutionary Constraint Network Construction
Given the clusters of each SSNV group, we construct an evolutionary constraint network to capture valid evolutionary timing relationships between the mutations of each cluster pair. The network is a DAG, where each node corresponds to an SSNV cluster (except the root, which represents the germline) and each edge between two nodes, () denotes that parent node could be an evolutionary predecessor of child node (i.e. that SSNVs in cluster could have ”happened before” SSNVs in cluster ). In particular, an edge () is added only if the nodes satisfy the following two constraints (which guarantee that the network will be acyclic):
where is the VAF noise error margin (note: the error margin is the maximum of the sum of the standard errors for sample of the two clusters and a configurable parameter). In the resulting network, each node will have at least one parent (since all nodes can be connected to the root). We avoid checking all node pairs, by organizing the nodes into levels according to the Hamming weight (i.e. number of 1s) of the binary group profile to which they belong. Nodes that are in the same level and have conflicting binary profiles cannot satisfy the above constraints. Nodes from different levels can only be connected such that the node in the higher level is the parent of the node in the lower level. Finally, for nodes that are in the same level and have the same binary profile, the edge is added in the direction that minimizes , where:
with the indicator function:
Figure 6 illustrates the constraint network produced for the dataset of ccRCC Patient EV007 (described in the Results section).
Phylogenetic Tree Search
By the constraint network construction, a valid lineage tree of the SSNV clusters (i.e. a tree that does not violate the three constraints of the perfect phylogeny model) must be a spanning tree of the network that satisfies the following requirement :
That is, the sum of the VAF centroids of all the children must not exceed the centroid of the parent. We use inequality here since our method does not require all the true lineage branches to have been observed. To tolerate noise in the VAF data, we relax the constraint by allowing the sum of children VAFs to exceed the parent by an error margin . Given such a valid lineage tree, each sample can then be decomposed into subpopulations by enumerating all the paths in the tree starting with the germline root and ending in the last node containing mutations in that sample.
The problem of finding all such trees is equivalent to the problem of finding all spanning trees of the constraint network DAG for which Eqn. (5) holds. We have extended the Gabow and Myers spanning tree search algorithm  to generate all such spanning trees. The original algorithm generates all spanning trees of a directed graph using backtracking and an efficient bridge edge detection method based on DFS. Our extension consists of enforcing Eqn. (5) during the tree search by terminating the expansion of a given tree and backtracking as soon as an edge violating Eqn. (5) is added (see Algorithm 1). Same as the original algorithm, this search runs in time, where is the number of nodes, is the number of edges, and is the number of spanning trees in the network. While the runtime of the program depends on the number of spanning trees in the constraint network, in practice the search is very fast (taking on the order of a few seconds). However, since, in theory, it is possible for the algorithm to take longer on datasets that result in constraint networks with many spanning trees, we provide a bound on the maximum number of lineage trees to generate in order to avoid searches that are too long. Similarly, we also have a high bound on the number of calls to the GROW procedure in Algorithm 1. We expect this scenarios to be very rare in typical validation datasets. To reduce search space we also optionally constrain the placement of private mutation nodes in the constraint network to their closest valid predecessors.
The above search algorithm will find all spanning trees for which Eqn. (5) holds locally at each individual node; however, it is possible that in order to satisfy this constraint, the centroid values are deviated (within ) in a globally inconsistent way. Therefore, in order to enforce consistency, we apply one additional requirement to the trees returned by the search. In particular, we formulate the following quadratic programming (QP) problem to find a set of such that for every sample and node :
If no solution exists, the tree is considered invalid. Since multiple valid lineage trees can be generated, we rank them using the resulting solution, which corresponds to how well each tree fits the VAF data. The top-ranking tree will be the tree with the minimum sum of squared deviations: . Since for certain networks the total number of valid lineage trees can be large, it is impractical to run a QP program on each tree. Therefore, we first rank the trees based on the sum of squared deviations computed locally at each node, and then run QP on the resulting top trees ( is 5 by default).
Multiple lineage trees can support a dataset equally well since the placement of certain nodes in the tree may be ambiguous using perfect phylogeny SSNV ordering constraints only (especially when not all lineage branches have been observed). In these cases, more sophisticated custom evaluation criteria would be required to rank the trees. We do not address defining such criteria in this work. Instead we report all the produced trees to the user ranked by their associated score (the number of trees reported is configurable) and provide a GUI to allow the user to easily explore the differences in the topologies (see the Visualization for details). We expect that under any optimality criteria, the sub-optimal trees can also represent signals in the data of potential biological significance.
It is also possible that no valid lineage trees are found during the search. This can happen if the noise error margin is too narrow or the network contains nodes corresponding to SSNV groups with a misclassified binary profile. In this case the network is adjusted and searched again. Currently the network adjustment procedure will remove one-by-one all the nodes that belong to non-robust SSNV groups (smallest nodes first). Other adjustments, such as increasing the noise error margin, are also possible but not currently implemented.
The constraint network and the phylogenetic trees can be visualized and interacted with in GUI form. The JUNG graph library  is used to generate the resulting graphs. When visualizing lineage trees, each input sample appears as a leaf in the tree and is connected to the nodes that contain SSNVs present in the sample. By clicking on the nodes of a tree, it is possible to obtain additional information about each node. The information displayed about an internal (non-sample) tree node consists of its binary group profile, its cluster centroid and standard deviation vectors, and the list of the SSNVs in its cluster (these SSNVs can be annotated with information from public databases such as COSMIC, TCGA, etc). If a sample leaf node is clicked, the information displayed consists of the lineage of this specific sample obtained by doing a DFS traversal of the tree starting with the germline root. Finally, the user can rearrange the nodes in the tree, as well as remove nodes and collapse nodes (provided they are clusters of the same group). See Figure 7 for several examples. In addition to the GUI, the program reports the number of trees found and the score of the highest-ranking tree. The user can control how many trees to display.
The LICHeE algorithm was implemented in Java. It is open-sourced and freely available online at .
SB is a founding advisor of DNAnexus Inc, and member of the Scientific Advisory Boards of 23andMe and Eve Biomedical.
VP, RS, and SB developed the method. VP implemented the LICHeE program and simulator and drafted the manuscript. RS, SB, DK, IH contributed to the manuscript. VP, RS, SB performed the evaluation and analysis. All authors were involved in discussions.
Authors would like to thank Arend Sidow for valuable discussions and Aaron C. Abajian for contributing to the simulation studies.
VP was supported by the Stanford-KAUST grant. RS and IH were also supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Postdoctoral Fellowships. DK was supported by an STMicroelectronics Stanford Graduate Fellowship. This work was funded by a grant from KAUST to SB.
Appendix A — Evaluation of Phylosub
We ran PhyloSub  on the Gerlinger et al dataset  using default parameters. For each patient in the dataset we compared the top tree structures (by default, three) with the published tree. It is hard to find similarities between the topology of reported trees and the published trees. Therefore, we analyzed the performance of PhyloSub in finding the correct clusters of mutations, which is a preliminary requirement for building the cancer progression pathway. Gerlinger et. al have categorized mutation groups as shared, heterogeneous (partially shared), and private. Generally speaking, we observed that shared mutations were clustered correctly but partially shared mutations were often clustered all together as one group, which violates their presence pattern. As an example, PhyloSub reported a tree for patient EV003 with three mutation groups (see Figure 8 below). The trunk branch (colored blue) includes all shared mutations as well as one private and two partially shared mutations. Second branch (colored green) contains several partially shared mutations and the rest of private mutations with very different presence patterns, and the third branch (colored red) contains three partially shared mutations. Case EV003 has a heterogenous group with 8 mutations shared by R9 and R6 (LONRF2, RHOB, BHLHE40, CMYA5, NOD1, GCC1, SLC5A12, YLPM1), whereas PhyloSub puts two of them in the trunk branch, four in the second, and two in the third brach. Note that, as discussed in the Results section, for EV003 LICHeE was able to find all mutation groups and build the lineage matching the reported tree in Gerlinger et. al dataset. For other cases PhyloSub showed similar performance. We also observed that PhyloSub does not deal with private mutations properly. Often private mutations are either clustered into singleton groups or are distributed into shared or partially shared groups. For example, for six out of eight patients in at least one of the top-trees there are private mutations clustered with shared mutations and located in the trunk branch of the corresponding tree. Note that PhyloSub shows acceptable performance in analyzing samples with very few mutations and simple (chain) topology. However, when tested on more complex multi-sample cancer datasets, the performance of LICHeE is far more superior.
Appendix B — Experimental Details of the ccRCC, HGSC, and Xenoengraftment Comparison
ccRCC. The clear cell renal cell carcinoma (ccRCC) study by Gerlinger et. al  validated 602 nonsynonymous nucleotide substitutions and indels from multiple samples of 8 individuals using ultra-deep amplicon sequencing with an average depth of 400x. The analysis was performed on 587 out of these mutations guaranteeing a minimum coverage higher than 100x. Based on the expected sequencing platform error rate, mutations were called in a sample if their VAF was 0.5% for substitutions and 1% for indels. In accordance with the study, we set the cutoffs to = = 0.005 for calling SSNVs across samples for all patients except RMH004 and required at least two mutations as evidence of a node in the tree. For RMH004 we set = = 0.01 to allow editing, observing that some mutations with the frequencies slightly higher that 0.5% were considered absent in some samples of the study.
HGSC. In the high-grade ovarian cancer (HGSC) dataset from the study by Bashashati et. al. , 19 tumor samples from six patients were used to validate 340 somatic mutations using deep amplicon sequencing (with a median coverage of 5000x). When running LICHeE, we used the hard thresholds of = 0.005 and = 0.01 for all the patients except Case5, where we used = 0.01 and = 0.04 due to a higher level of noise in the data, and required at least three mutations as evidence of a node in the tree. For Case5 we also removed the samples g and h from consideration since they had multiple inconclusive validation results as stated in the manuscript .
Xenoengraftment. The study by Eirew et. al.  used deep-genome and single-cell sequencing to evaluate the clonal dynamics of xenoengraftment of breast cancer tissue into immunodeficient mice. Single-cell analysis was done on passages SA501 (samples X1, X2, and X4) and SA494 (samples T and X4). When running LICHeE on both passages, we used the hard thresholds of = 0.03 and = 0.05 (the value 0.05 was indicated in the study’s supplementary materials); a minimum cluster size of 6 for non-private mutations and 5 for private mutations; and a maximum cluster collapse distance of 0.085.
- Yates, L.R., Campbell, P.J.: Evolution of the cancer genome. Nature Reviews Genetics 13(11), 795–806 (2012)
- Marusyk, A., Almendro, V., Polyak, K.: Intra-tumour heterogeneity: a looking glass for cancer? Nature Reviews Cancer 12(5), 323–334 (2012)
- Bamford, S., Dawson, E., Forbes, S., Clements, J., Pettett, R., Dogan, A., Flanagan, A., Teague, J., Futreal, P.A., Stratton, M., et al.: The cosmic (catalogue of somatic mutations in cancer) database and website. British journal of cancer 91(2), 355–358 (2004)
- Collins, F.S., Barker, A.D.: Mapping the cancer genome. Scientific American 296(3), 50–57 (2007)
- Fisher, R., Pusztai, L., Swanton, C.: Cancer heterogeneity: implications for targeted therapeutics. British journal of cancer 108(3), 479–485 (2013)
- Schuh, A., Becq, J., Humphray, S., Alexa, A., Burns, A., Clifford, R., Feller, S.M., Grocock, R., Henderson, S., Khrebtukova, I., et al.: Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood 120(20), 4191–4196 (2012)
- Newburger, D.E., Kashef-Haghighi, D., Weng, Z., Salari, R., Sweeney, R.T., Brunner, A.L., Zhu, S.X., Guo, X., Varma, S., Troxell, M.L., et al.: Genome evolution during progression to breast cancer. Genome research 23(7), 1097–1108 (2013)
- Carter, S.L., Cibulskis, K., Helman, E., McKenna, A., Shen, H., Zack, T., Laird, P.W., Onofrio, R.C., Winckler, W., Weir, B.A., et al.: Absolute quantification of somatic dna alterations in human cancer. Nature biotechnology 30(5), 413–421 (2012)
- Campbell, P.J., Pleasance, E.D., Stephens, P.J., Dicks, E., Rance, R., Goodhead, I., Follows, G.A., Green, A.R., Futreal, P.A., Stratton, M.R.: Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proceedings of the National Academy of Sciences 105(35), 13081–13086 (2008)
- Nik-Zainal, S., Van Loo, P., Wedge, D.C., Alexandrov, L.B., Greenman, C.D., Lau, K.W., Raine, K., Jones, D., Marshall, J., Ramakrishna, M., et al.: The life history of 21 breast cancers. Cell 149(5), 994–1007 (2012)
- Shah, S.P., Roth, A., Goya, R., Oloumi, A., Ha, G., Zhao, Y., Turashvili, G., Ding, J., Tse, K., Haffari, G., et al.: The clonal and mutational evolution spectrum of primary triple-negative breast cancers. Nature 486(7403), 395–399 (2012)
- Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A., Shah, S.P.: Pyclone: statistical inference of clonal population structure in cancer. Nature methods (2014)
- Oesper, L., Mahmoody, A., Raphael, B.J.: Inferring intra-tumor heterogeneity from high-throughput dna sequencing data. In: Research in Computational Molecular Biology, pp. 171–172 (2013). Springer
- Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L.M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., et al.: Titan: inference of copy number architectures in clonal cell populations from tumor whole-genome sequence data. Genome research, 180281 (2014)
- Hajirasouliha, I., Mahmoody, A., Raphael, B.J.: A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioinformatics 30(12), 78–86 (2014)
- Ding, L., Ley, T.J., Larson, D.E., Miller, C.A., Koboldt, D.C., Welch, J.S., Ritchey, J.K., Young, M.A., Lamprecht, T., McLellan, M.D., et al.: Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing. Nature 481(7382), 506–510 (2012)
- Landau, D.A., Carter, S.L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M.S., Sougnez, C., Stewart, C., Sivachenko, A., Wang, L., et al.: Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152(4), 714–726 (2013)
- McFadden, D.G., Papagiannakopoulos, T., Taylor-Weiner, A., Stewart, C., Carter, S.L., Cibulskis, K., Bhutkar, A., McKenna, A., Dooley, A., Vernon, A., et al.: Genetic and clonal dissection of murine small cell lung carcinoma progression by genome sequencing. Cell 156(6), 1298–1311 (2014)
- Campbell, P.J., Yachida, S., Mudie, L.J., Stephens, P.J., Pleasance, E.D., Stebbings, L.A., Morsberger, L.A., Latimer, C., McLaren, S., Lin, M.-L., et al.: The patterns and dynamics of genomic instability in metastatic pancreatic cancer. Nature 467(7319), 1109–1113 (2010)
- Yachida, S., Jones, S., Bozic, I., Antal, T., Leary, R., Fu, B., Kamiyama, M., Hruban, R.H., Eshleman, J.R., Nowak, M.A., et al.: Distant metastasis occurs late during the genetic evolution of pancreatic cancer. Nature 467(7319), 1114–1117 (2010)
- Gerlinger, M., Horswell, S., Larkin, J., Rowan, A.J., Salm, M.P., Varela, I., Fisher, R., McGranahan, N., Matthews, N., Santos, C.R., et al.: Genomic architecture and evolution of clear cell renal cell carcinomas defined by multiregion sequencing. Nature genetics 46(3), 225–233 (2014)
- Green, M.R., Gentles, A.J., Nair, R.V., Irish, J.M., Kihira, S., Liu, C.L., Kela, I., Hopmans, E.S., Myklebust, J.H., Ji, H., et al.: Hierarchy in somatic mutations arising during genomic evolution and progression of follicular lymphoma. Blood 121(9), 1604–1611 (2013)
- de Bruin, E.C., McGranahan, N., Mitter, R., Salm, M., Wedge, D.C., Yates, L., Jamal-Hanjani, M., Shafi, S., Murugaesu, N., Rowan, A.J., et al.: Spatial and temporal diversity in genomic instability processes defines lung cancer evolution. Science 346(6206), 251–256 (2014)
- Hajirasouliha, I., Raphael, B.J.: Reconstructing mutational history in multiply sampled tumors using perfect phylogeny mixtures. In: Algorithms in Bioinformatics, pp. 354–367. Springer, ??? (2014)
- Salari, R., Saleh, S.S., Kashef-Haghighi, D., Khavari, D., Newburger, D.E., West, R.B., Sidow, A., Batzoglou, S.: Inference of tumor phylogenies with improved somatic mutation discovery. Journal of Computational Biology 20(11), 933–944 (2013)
- Gerlinger, M., Rowan, A.J., Horswell, S., Larkin, J., Endesfelder, D., Gronroos, E., Martinez, P., Matthews, N., Stewart, A., Tarpey, P., et al.: Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. New England Journal of Medicine 366(10), 883–892 (2012)
- Bashashati, A., Ha, G., Tone, A., Ding, J., Prentice, L.M., Roth, A., Rosner, J., Shumansky, K., Kalloger, S., Senz, J., et al.: Distinct evolutionary trajectories of primary high-grade serous ovarian cancers revealed through spatial mutational profiling. The Journal of pathology 231(1), 21–34 (2013)
- Qiao, Y., Quinlan, A.R., Jazaeri, A.A., Verhaak, R.G., Wheeler, D.A., Marth, G.T.: Subcloneseeker: a computational framework for reconstructing tumor clone structure for cancer variant interpretation and prioritization. Genome Biol 15, 443 (2014)
- Jiao, W., Vembu, S., Deshwar, A.G., Stein, L., Morris, Q.: Inferring clonal evolution of tumors from single nucleotide somatic mutations. BMC bioinformatics 15(1), 35 (2014)
- Eirew, P., Steif, A., Khattra, J., Ha, G., Yap, D., Farahani, H., Gelmon, K., Chia, S., Mar, C., Wan, A., et al.: Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature (2014)
- Gusfield, D.: Efficient algorithms for inferring evolutionary trees. Networks 21(1), 19–28 (1991)
- Gerstung, M., Beisel, C., Rechsteiner, M., Wild, P., Schraml, P., Moch, H., Beerenwinkel, N.: Reliable detection of subclonal single-nucleotide variants in tumour cell populations. Nature communications 3, 811 (2012)
- Yost, S., Alakus, H., Matsui, H., Jepsen, K., Schwab, R., Frazer, K., Harismendy, O.: Mutascope: a high-sensitivity variant caller dedicated to high-throughput pcr amplicons sequencing. In: BMC Proceedings, vol. 6, p. 22 (2012). BioMed Central Ltd
- Larson, D.E., Harris, C.C., Chen, K., Koboldt, D.C., Abbott, T.E., Dooling, D.J., Ley, T.J., Mardis, E.R., Wilson, R.K., Ding, L.: Somaticsniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics 28(3), 311–317 (2012)
- Koboldt, D.C., Zhang, Q., Larson, D.E., Shen, D., McLellan, M.D., Lin, L., Miller, C.A., Mardis, E.R., Ding, L., Wilson, R.K.: Varscan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome research 22(3), 568–576 (2012)
- Saunders, C.T., Wong, W.S., Swamy, S., Becq, J., Murray, L.J., Cheetham, R.K.: Strelka: accurate somatic small-variant calling from sequenced tumor–normal sample pairs. Bioinformatics 28(14), 1811–1817 (2012)
- Van Loo, P., Nordgard, S.H., Lingjærde, O.C., Russnes, H.G., Rye, I.H., Sun, W., Weigman, V.J., Marynen, P., Zetterberg, A., Naume, B., et al.: Allele-specific copy number analysis of tumors. Proceedings of the National Academy of Sciences 107(39), 16910–16915 (2010)
- Greaves, M., Maley, C.C.: Clonal evolution in cancer. Nature 481(7381), 306–313 (2012)
- Yap, T.A., Gerlinger, M., Futreal, P.A., Pusztai, L., Swanton, C.: Intratumor heterogeneity: seeing the wood for the trees. Science translational medicine 4(127), 127–1012710 (2012)
- Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1), 10–18 (2009)
- Gabow, H.N., Myers, E.W.: Finding all spanning trees of directed and undirected graphs. SIAM Journal on Computing 7(3), 280–287 (1978)
- O’Madadhain, J., Fisher, D., Smyth, P., White, S., Boey, Y.-B.: Analysis and visualization of network data using jung. Journal of Statistical Software 10(2), 1–35 (2005)