A cubic-time algorithm for computing the trinet distance between level-1 networks
In evolutionary biology, phylogenetic networks are constructed to represent the evolution of species in which reticulate events are thought to have occurred, such as recombination and hybridization. It is therefore useful to have efficiently computable metrics with which to systematically compare such networks. Through developing an optimal algorithm to enumerate all trinets displayed by a level-1 network (a type of network that is slightly more general than an evolutionary tree), here we propose a cubic-time algorithm to compute the trinet distance between two level-1 networks. Employing simulations, we also present a comparison between the trinet metric and the so-called Robinson-Foulds phylogenetic network metric restricted to level-1 networks. The algorithms described in this paper have been implemented in JAVA and are freely available at (https://www.uea.ac.uk/computing/TriLoNet).
keywords:Phylogenetic tree, Phylogenetic network, Level-1 network, Trinet, Robinson-Foulds metric
Msc:68Q17, 05C05, 05C85, 92B05
Various types of phylogenetic networks have been introduced to explicitly represent the reticulate evolutionary history of organisms such as viruses and bacteria in which processes such as recombination and lateral gene transfer occur (1). Essentially, such networks are binary, directed acyclic graphs with a single root, whose leaves correspond to the organisms or species in question. Here we focus on level-1 networks, a type of phylogenetic network that is slightly more general than an evolutionary tree, and closely related to so-called galled-trees (see, e.g. (2)). Level-1 networks are characterized by the property that any two cycles within them are disjoint (see the next section for a formal definition and Fig. 1 for an example). Due to the availability of practical algorithms for their construction (3); (4), level-1 networks have attracted much attention in recent years (see, e.g. (2); (5); (6); (7)) and they have been used to, for example, represent the evolution of the fungus Fusarium graminearum (1), and that of HBV (4).
A key challenge for phylogenetic networks is to quantify the incongruence between two networks which represent competing evolutionary histories for a given dataset. Such pairs can arise, for example, when different networks are inferred using different methods or construction (see e.g. (8) for an overview of network building methods). In consequence, various metrics have been developed for comparing phylogenetic networks (cf. Chapter 6 in (1) for an overview). Ideally, such a metric should be efficient to compute since it may need to be repeatedly computed (for example, in simulations such as the ones that we present later in this paper). Moreover, it is useful if the diameter can be derived for the metric (i.e. the maximum value for the metric taken over all pairs of all possible networks) so that distances can be normalized.
Here we develop an efficient cubic-time algorithm to compute the trinet distance between two level-1 networks, that is, the number of trinets (i.e., networks on three taxa) displayed by one but not both networks. We also give the diameter of this metric. The trinet metric was introduced in (9) and used in (4) to compare the performance of network inference algorithms. Note that the trinet distance is closely related to the triplet distance, which is the number of 3-leaved trees exhibited by one but not both networks (see, e.g. (10)). However, in contrast to the trinet metric, the triplet metric is not proper in that there exist pairs of distinct level-1 networks whose triplet distance is zero. In addition to the trinet metric, other proper metrics that can be used for comparing level-1 networks include the tripartition metric (11), the path-multiplicity metric (12), the NNI metric (13), and the Robinson-Foulds metric (2). Among these metrics, only the NNI metric was specifically defined for level-1 networks, while the others were introduced for more general classes of networks and can be restricted to level-1 networks to give proper level-1 metrics. However, establishing the diameters for these other metrics on level-1 metrics appears to be a challenging problem, although in this paper we shall derive the diameter for the restricted Robinson-Foulds metric.
In the next section we introduce some basic notation and state the main result: an optimal algorithm to enumerate the trinets displayed by a level-1 network and a cubic-time algorithm to compute the trinet distance between two level-1 networks (Theorem 1). In Section 3 we present some structural results concerning level-1 networks which we then use to prove the main result in Section 4. In Section 5 we present a comparative study between the trinet and the Robinson-Foulds metrics, in which we compute some empirical distributions for randomly generated level-1 networks. We conclude in Section 6 with a discussion of some future directions.
Let be a finite set of taxa with cardinality . A rooted phylogenetic network (or simply a network) on a finite set is a simple, acyclic digraph with a unique root, no degenerate vertices (i.e., vertices with indegree one and outdegree one), whose leaves are bijectively labelled by the taxa in . A network is binary if all non-leaf vertices have indegree and outdegree at most two, and all vertices with indegree two have outdegree one. A vertex is a tree vertex if it has outdegree two, and a reticulation if it has indegree two. A network is level- if the maximum number of reticulations contained in any of its biconnected components is at most . Note that a network is level-1 if it is binary and all of its cycles (in its underlying graph) are disjoint (1) (see Fig. 1 for an example). All networks mentioned in this paper, unless stated otherwise, are level-1.
Given a network, an arc whose removal disconnects the network is a cut arc. If a vertex is on a dipath from the root to a vertex , then we say is below and is above , and write this as (or when holds). The set of all taxa below a vertex is called the cluster of . A common ancestor of a taxon subset is a vertex with . A lowest common ancestor (LCA) of is a common ancestor of that is not above any other common ancestors of . A stable ancestor of is a vertex contained in every dipath from the root to some taxon in . The lowest stable ancestor (LSA) of is the unique vertex such that is below every stable ancestor of . Note that a LCA of is necessary below (c.f. (14)). Finally, the lsa table of is the data structure that maps each pair of district taxa to (see Fig. 2 for an illustration).
A binet is a network on two taxa and a trinet is a network on three taxa. Up to relabelling, there exist two types of binets and eight types of trinets (9), all presented in Fig. 3. In the following, we will use the notation in that figure to refer to specific trinets and binets. Binets and are referred to as a cherry and a reticulate cherry, respectively. Note that a reticulate cherry is not symmetric, that is, is distinct from .
Given a network and a taxon subset of , the network is the network obtained from by deleting all vertices and arcs that are not on a dipath from to some leaf in , and repeatedly suppressing degree 2 vertices and replacing parallel arcs by single arcs until neither operation is applicable. Let and be the set of all binets and trinets displayed by , respectively. It is known that a level-1 network is determined by its set of trinets (9).
The trinet distance between two networks and on the set is the number of trinets contained in the symmetric difference of the sets and . The distance is a metric on the set of level-1 networks (9). Moreover,
holds for any pair of networks with equality holding if, for example, is a tree and is a saturated level-1 network (that is, each non-leaf vertex is contained in a cycle of size three; see (13)). Hence, the diameter of is . We now present our main result, whose proof will be presented in Section 4.
The set of trinets displayed by a level-1 network on can be constructed in time. In addition, the trinet-distance between two level-1 networks and on can be computed in time .
3 Theoretical Results
In this section, we present some structural results concerning level-1 networks. First, note that given a level-1 network on , we have
with equality holding if and only if is saturated. The proof of this fact is similar to that for Lemma 1 in (13), and so we omit it.
Next, we show that in a level-1 network , each taxon subset of has a unique lowest common ancestor, denoted by . Note that this is not true for level-2 networks (see, for example, ((1), p140)).
Each taxon subset of has a unique lowest common ancestor in a level-1 network on . Moreover, either holds or there exists a unique dipath from to , which does not contain any cut arc.
[Proof.]We may assume that since otherwise the proposition clearly holds. Fix a LCA of and let . Without loss of generality, we may also assume that as otherwise and the proposition follows.
We first show that there exists a cycle in containing both and . To this end, fix a dipath from to . It suffices to show that contains no cut arc. If this is not the case, let be a cut arc in . Then and every dipath from the root of to a taxon below must contain . Together with and , this implies is a stable ancestor of , a contradiction to .
It remains to show that is the unique LCA of . If not, let be a LCA of with . Then neither nor holds. Now an argument similar to that in the last paragraph shows that and belong to the same cycle . In addition, is the highest vertex in . Let and be the two interior disjoint dipaths in so that contains and contains . Let be the child of contained in and be the other child. Then is a cut arc. Since is not a common ancestor of , there exists a taxon with . Since is not above and is a cut arc, is not above , a contradiction.
By the last proposition, a pair of distinct taxa in a level-1 network on have a unique LCA, denoted by . Moreover, is precisely the interior vertex for which one child of is above but not while the other child of is above but not .
A splitting ancestor of and is an interior vertex of such that precisely one taxon from is below both of its children (while the other taxon is below only one of its two children). For instance, in the root is the unique splitting ancestor of and while contains none.
Suppose are distinct taxa in a level-1 network on . Then the following three assertions are equivalent:
(1a) is a cherry;
(1c) and do not have a splitting ancestor.
Moreover, the following three assertions are also equivalent:
(2a) is a reticulate cherry;
(2c) is the unique splitting ancestor of and .
[Proof.]Let . It is easy to show “(1a)(1b)”, from which “(2a)(2b)” follows.
“(1b)(1c)”: Assume and that is a splitting ancestor of and . Let be the child of with . Swapping and if necessary, we assume the other child is above but not . Consider a dipath from the root of to and a dipath from to such that neither of them contains . Let be the dipath constructed by combining , , and the arc . Then is a dipath from to that does not contain , in contradiction to being a stable ancestor.
“(2b)(2c)”: By Proposition 2 there exists a unique path from to . Fix an arbitrary vertex in . Then is 2 if , less than 2 if , and equal to 0 if is neither above nor below . Therefore, contains all splitting ancestors of and , from which it follows that is the unique splitting ancestor of and .
Finally, we have “(1c)(1b)” as its contrapositive follows directly from “(2b)(2c)”. Similarly, “(2c)(2b)” holds as its contrapositive follows directly from “(1b)(1c)”.
In this section we present an algorithm for extracting trinets from a network on , from which we can also immediately compute the trinet distance between pairs of networks.
4.1 Extracting Binets
Our first step (see Algorithm 1) is to construct and the lsa table for a level-1 network on in time .
[t] for all in and
Find a topological sort of all tree vertices so that implies and construct for every vertex \For to Let and the clusters displayed by the two children of \For and and \For and \If and return the set and the table .
Fix a taxon subset in . We shall show that is the only binet on that is contained in the set constructed in the algorithm, and that . The first case is that is a cherry. Then by Theorem 3(1), does not contain a splitting common ancestor of and , and hence the pair does not occur in the for loop starting with line 4. In addition, by the comment below Proposition 2 the pair and occurs once in the for loop starting with line 7 (when ), from which it follows is the only binet on contained in , and .
Now consider the second case in which is a reticulate cherry. Swapping the subscripts if necessary, we may assume that . By Theorem 3, is the unique splitting ancestor of and in and hence the pair and will occur in the for loop starting with line 5 (when ), referred to here as the first event. Since is below both children of , it follows that in line 6 the binet is added to and .
Next, since is the unique LCA of and , the pair and will also occur once in the for loop starting with line 7 (when ), referred to as the second event. Since and the vertices of are topologically sorted, the first event always occurs before the second one. So when the second event occurs, has already been assigned to a vertex in , and hence line 9 will be skipped. Therefore, is the only binet on in the set , and .
Finally, Eq. (2) implies that contains vertices and arcs, and hence line 2 can be computed in . Moreover, the analysis in the above three paragraphs implies that a pair of taxa is checked precisely once in line 7 and at most once in line 5. Therefore, the running time of Algorithm 1 is .
4.2 Extracting Trinets
Our next step is to extract trinets from a network on . A key insight that we shall use is that each of the eight trinet types has a unique signature in terms of the binets it contains and the lsa table.
For a network on and a triple in , swapping the indices if necessary, we can assume that , and is either or . Let be the number of cherries among the three binets on . Then can be inferred as follows.
Case : Let be if , and otherwise.
Case : If , let be . Otherwise if , then . Finally, let be if contains , and otherwise.
Case : Let be if , and otherwise.
Clearly, the above process can be completed in constant time, and hence the trinet set displayed by a network on , as well as the trinet distance between two networks on can be computed in , from which Theorem 1 follows.
To obtain some intuition concerning the empirical behaviour of the trinet metric, and how its behaviour compares to that of the Robinson-Foulds distance, we implemented both metrics and performed experiments in which we computed distributions of the metrics for pairs of randomly generated networks. Note that similar experiments have been performed to understand properties of tree metrics (15) and RNA metrics (16).
We begin by recalling the Robinson-Foulds phylogenetic network metric , which can be restricted to give a metric on level-1 networks, and can be regarded as a generalization of a commonly used metric on evolutionary trees with the same name (17). For a pair of level-1 networks on the distance is defined to be the size of the multiset that is the symmetric difference of the two multisets of the clusters induced by and (2). Since the root and leaf vertices of and both induce identical clusters, it follows that
Using Eq. (2), it is straight-forward to check that the last inequality implies that
holds for arbitrary pairs , with equality holding in case, for example, and are both saturated networks that are obtained by replacing each interior vertex in two distinct trees whose only common clusters are and the singletons with a cycle of size three. It follows that the diameter of is . Using Eq. (1) and Eq. (3), for comparison purposes in our experiments we normalized both the trinet and Robinson-Foulds metrics to take on values between and .
To perform our experiments, we generated three sets of random level-1 networks on 50 leaves in two ways as follows. The first two datasets, Lev(1) and Lev(10), were generated under the model detailed in (3), using one and ten seeds, respectively, which can be found on the website mentioned in the abstract. The third dataset, Ran, was generated using the following procedure: starting with a random binet on two taxa, in each step the current network is grown by adding a taxon to a randomly chosen arc employing one of the three operations depicted in Fig. 4 until a level-1 network with the specified leaf-set is obtained.
Distributions for the normalized trinet and Robinson-Foulds metrics for the three datasets are presented in Fig. 5. For our datasets, we see that the trinet metric has a much larger range of values and a larger variance as compared to the Robinson-Foulds metric. This is quite similar to the behaviour of the Robinson-Foulds metric and quartet-distance evolutionary tree metrics described in ((15), Fig.6). Note that there is a gap between the values presented in Fig. (5)-(ii)a, which could be caused by the choices of seeds in the generators and by the way that the metric is normalized.
We also performed similar studies on networks with 15 and 25 leaves and found that as the number of leaves increases, the distribution of range of values got tighter for both metrics. For example, the standard deviations of the normalized and metrics on the Ran datasets with 15, 25 and 50 leaves were 0.034, 0.032, 0.032, and 0.046, 0.038, 0.026, respectively. We also recorded the timings for computing the two metrics. On a MacBook Pro computer with an i7 processor and 16 GB RAM, the average time for computing the trinet metric on Lev(1), Lev(10), and Ran were 140, 145, and 231 minutes for the trinet metric and 16, 21, and 58 minutes for the Robinson-Foulds metric. Thus, as anticipated, the Robinson-Foulds metric appears to be somewhat faster to compute in practice.
We have presented an algorithm which allows us to compute the metric for level-1 networks, and demonstrated that this allows us to compute this metric in reasonable time for networks of up to 50 leaves. We have seen that although the metric is faster to compute, it does not give the range of values that might be necessary to properly distinguish between networks. However, for certain applications could still serve as a rough measure of distance suffices when timings are more critical. Thus, as suggested for tree and RNA metrics in (15); (16), we do not advocate using or over any other metric; the choice of metric will depend very much on the application.
Although our cubic-time algorithm for enumerating the trinets displayed by a level-1 network is optimal, it would be interesting to know if there is a more efficient algorithm to compute the trinet distance between two level-1 networks which does not involve listing the trinets displayed by the networks. Note that the Robinson-Foulds distance between two trees can be computed in linear time using Day’s algorithm (18) without the need to list all clusters displayed by the trees (see also (19) for a sublinear approximation algorithm). It may be worth exploring whether similar ideas could be exploited to more efficiently compute the trinet and the Robinson-Foulds distance between two level-1 networks.
In future work, it could be of interest to determine analytical formulae for the expected values and variances of and as well as other metrics. Such formulae were given for different types of tree metrics in (15). However, as a first step it would be probably be necessary to develop ways to generate level-1 networks with a certain distribution (e.g. uniformly at random), which appears to be a challenging problem.
In addition to the two metrics studied here, as mentioned in the introduction there are other proper metrics on level-1 networks (e.g. the tripartitions (2) and NNI (13) metrics). However, we do not know how to normalize these metrics by finding their diameters. This is important for comparison purposes, for example, in the experiments that we present above. Therefore it would be interesting to find the diameter for other level-1 metrics and so that they can be systematically compared with and .
Finally, in this paper we have only considered level-1 networks, and it could be useful to develop efficient algorithms to compute trinet metrics for level- networks with , especially in the case where the trinets are known to determine the network (20). However, for networks with much higher levels this is likely to be challenging since, as opposed to level-1 networks, they are not necessarily determined by their trinets (21). Therefore it might also be of interest to restrict to special classes of networks (e.g. tree-child networks), where more is known concerning their structure (20).
Acknowledgements The authors thank the anonymous referee for help suggestions on an earlier version of this manuscript.
- D. Huson, R. Rupp, C. Scornavacca, Phylogenetic Networks: Concepts, Algorithms and Applications, Cambridge University Press, 2010.
- G. Cardona, M. Llabres, F. Rossello, G. Valiente, Comparison of galled trees, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 8 (2011) 410–427.
- K. Huber, L. van Iersel, S. Kelk, R. Suchecki, A practical algorithm for reconstructing level-1 phylogenetic networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics 8 (2011) 635–649.
- J. Oldman, T. Wu, L. van Iersel, V. Moulton, Trilonet: Piecing together small networks to reconstruct reticulate evolutionary histories, Molecular Biology and Evolution (2016) in press.
- L. Wang, K. Zhang, L. Zhang, Perfect phylogenetic networks with recombination, Journal of Computational Biology 8 (2001) 69–78.
- J. Jansson, N. B. Nguyen, W.-K. Sung, Algorithms for combining rooted triplets into a galled phylogenetic network, SIAM Journal on Computing 35 (2006) 1098–1121.
- D. Gusfield, ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks, MIT Press, 2014.
- D. Huson, C. Scornavacca, A survey of combinatorial methods for phylogenetic networks, Genome biology and evolution 3 (2011) 23–35.
- K. Huber, V. Moulton, Encoding and constructing 1-nested phylogenetic networks with trinets, Algorithmica 616 (2012) 714–738.
- J. Jansson, A. Lingas, Computing the rooted triplet distance between galled trees by counting triangles, Journal of Discrete Algorithms 25 (2014) 66–78.
- B. M. Moret, L. Nakhleh, T. Warnow, C. R. Linder, A. Tholse, A. Padolina, J. Sun, R. Timme, Phylogenetic networks: modeling, reconstructibility, and accuracy, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 1 (2004) 13–23.
- G. Cardona, F. Rossello, G. Valiente, Comparison of tree-child phylogenetic networks, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 6 (2009) 552–569.
- K. Huber, S. Linz, V. Moulton, T. Wu, Spaces of phylogenetic networks from generalised nearest-neighbor interchange operations, Journal of Mathematical Biology 72 (2016) 699–725.
- J. Fischer, D. H. Huson, New common ancestor problems in trees and directed acyclic graphs, Information processing letters 110 (2010) 331–335.
- M. Steel, D. Penny, Distributions of tree comparison metrics-some new results, Systematic Biology 42 (1993) 126–141.
- V. Moulton, M. Zuker, M. Steel, R. Pointon, D. Penny, Metrics on RNA secondary structures, Journal of Computational Biology 7 (2000) 277–292.
- D. Robinson, L. Foulds, Comparison of phylogenetic trees, Mathematical biosciences 53 (1981) 131–147.
- W. Day, Optimal algorithms for comparing trees with labeled leaves, Journal of classification 2 (1985) 7–28.
- N. Pattengale, E. Gottlieb, B. Moret, Efficiently computing the Robinson-Foulds metric, Journal of Computational Biology 14 (2007) 724–735.
- L. van Iersel, V. Moulton, Trinets encode tree-child and level-2 phylogenetic networks, Journal of Mathematical Biology 68 (2014) 1707–1729.
- K. Huber, L. Van Iersel, V. Moulton, T. Wu, How much information is needed to infer reticulate evolutionary histories?, Systematic biology 64 (2015) 102–111.