A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks

A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks

Katharina T. Huber, Leo van Iersel, Steven Kelk and Radosław Suchecki

Recently much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here we present an efficient, practical algorithm for reconstructing phylogenetic networks - a type of network slightly more general than a phylogenetic tree - from triplets. Our algorithm has been made publicly available as the program Lev1athan. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, Lev1athan runs in polynomial time and always constructs a level-1 network. If the data is consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some network, it will find such a network. The potential of Lev1athan is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that Lev1athan is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise.

Huber and Sucheki are affiliated with the School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom, email: Katharina.Huber@cmp.uea.ac.uk, R.Suchecki@uea.ac.uk. Van Iersel is affiliated with University of Canterbury, Department of Mathematics and Statistics, Private Bag 4800, Christchurch, New Zealand, email: l.j.j.v.iersel@gmail.com. Kelk is affiliated with the Centrum voor Wiskunde en Informatica (CWI), P.O. Box 94079, 1090 GB Amsterdam, The Netherlands, email: s.m.kelk@cwi.nl.

1. Introduction

Phylogenetic networks such as the one depicted in Figure 1 provide a natural and powerful extension of the concept of a phylogenetic tree (see Section 2 for precise definitions of these two concepts as well as the other concepts used in this paper) to accommodate complex evolutionary processes such as hybridization, recombination or horizontal gene transfer. Consequently, their attractiveness to evolutionary biology as a model for representing the evolutionary past of a set of taxa (e.g. species represented by gene or genetic marker sequences) whose evolution might have been driven by such processes has grown over the years. This in turn has generated much interest in these structures from mathematicians and computer scientists working in phylogenetics [9, 13, 15, 23].

Figure 1. A phylogenetic network with leaf set in the form of a level-1 network.

The desire of biologists to use ever-longer sequences, combined with the computational complexities involved in dealing with such sequences, has meant that much research in mathematical methodology and algorithm development has focused on developing indirect methods for phylogenetic reconstruction. This has spawned the thriving area of supertree construction [2] which aims to reconstruct a phylogenetic tree by puzzling it together from smaller trees. Inspired by this - and guided by the fundamental observations that (i) a phylogenetic tree is uniquely described by the set of triplets (i.e. rooted binary phylogenetic trees on three leaves) it induces (see e.g. [24]) and (ii) that a phylogenetic network is a generalization of a phylogenetic tree, two main triplet-based approaches for phylogenetic network reconstruction immediately suggest themselves. One is to essentially first employ a method such as TreePuzzle (see e.g. [22]) to generate a set of quartets111The analogue of a triplet within the area of unrooted phylogenetic tree reconstruction. from a sequence alignment, to then derive a set of triplets from that set and then to use the set to reconstruct a phylogenetic network. The other is to essentially first reconstruct phylogenetic trees on subsections of the given alignment (that might reflect different evolutionary scenarios) and then to reconstruct a phylogenetic network from the set of triplets induced by those trees.

For any set of triplets, a phylogenetic network that, in a well-defined sense, is consistent with all triplets in can easily be constructed if the complexity of the network is essentially unbounded [17, 31]. Such networks are however only of marginal biological relevance. Researchers have therefore turned their attention to studying restricted classes of phylogenetic networks. One such class is that of a network or a galled tree i.e. a phylogenetic network in which essentially any two cycles are vertex disjoint (see Figure 1 for an example). Such networks are of practical interest because (i) they are relatively treelike, and (ii) their simple structure suggests the possible existence of fast algorithms to construct them. Indeed, Jansson, Sung and Nguyen showed that it can be decided in polynomial time whether there exists a network with leaf set consistent with all input triplets [16, 17], if the input triplet set is dense, i.e. if a triplet is given for each combination of three taxa in . Their algorithm will also construct such a network, if it exists. However, a network consistent with all input triplets might not exist for several reasons. Firstly, the real evolutionary history might be too complex to be described by a network. Secondly, some of the input triplets might be incorrect (which is likely to be the case in practice).

One response to this problem has been to increase the complexity of the networks that can be modelled. For example, it has been shown that, for each fixed non-negative integer , the problem of constructing a level- network consistent with a dense set of input triplets is polynomial-time solvable [26, 28]. The higher the level, the higher the complexity of evolutionary scenarios that can be represented. However, the running time grows exponentially with and initial experiments with the related heuristic Simplistic [30, 29] show that, since these algorithms insist on full consistency with the input triplets, only a small amount of noise is required in the input data to artificially inflate the level of the produced network. This causes an undesirable increase in both running time and network complexity.

A second strategy is to place a ceiling on the complexity of the networks that can be constructed and to no longer demand full triplet consistency. Implemented in the program Lev1athan [12], this paper adopts this second strategy and presents the first heuristic algorithm for constructing networks from triplets. Given any set of triplets, our heuristic always constructs a network in polynomial time, which is in practice a great advantage in light of the algorithmic results mentioned above. Moreover, it attempts to construct such that it is consistent with as many of the input triplets as possible. If a weight is given for each triplet reflecting for example some kind of confidence level one might have in that triplet, then our heuristic aims to maximize the total weight of the input triplets consistent with . Optimizing such functions is NP-hard [31, 32], even for the case of determining a tree consistent with the maximum number of triplets from an unweighted, dense triplet set. Having said that, an exponential-time exact algorithm was proposed in [31] that always finds an optimal solution, but this algorithm is only practical for small numbers of taxa. In addition, polynomial-time approximation algorithms have been formulated for level-1 network reconstruction. For example it was first shown in [16] how to construct in polynomial time such a network consistent with 5/12 of the input triplets. This was subsequently improved to in [4], which is worst-case optimal. Both algorithms are mathematically interesting, but have the drawback that they produce networks with a highly rigid topology which are biologically unrealistic.

Lev1athan, which we outline in Section 3, combines elements from the above mentioned approaches into an algorithm with a strong recursive element. In addition to its polynomial running time, Lev1athan enjoys the following desirable properties. If a set of input triplets is consistent with a tree, or if at any stage in a recursion such a triplet set occurs, a phylogenetic tree will be generated from . Similarly, whenever the triplet set is dense and fully consistent with some network, such a network will be produced. Both outcomes are a direct consequence of a partitioning strategy that we describe in Section 4. If a network is produced that (when ignoring directions on the arcs) contains cycles of moderate size, then, due to a novel exponential-time exact algorithm which we describe in Section 5, the topology of each of these cycles is locally optimal. This algorithm is complemented by a novel greedy algorithm to construct larger cycles which we also describe in that section. In addition, the output network is guaranteed to be consistent with at least 5/12 of the input triplets, if one uses the Lev1athan option “blocks 3” which has its origin in the above mentioned partitioning strategy. In Section 6 we test Lev1athan on synthetic and real biological data; to facilitate this we also develop a novel network generation method and discuss several measures for network comparison. Taken as a whole the results of the experiments are promising, suggesting that Lev1athan will be genuinely useful in a real-world context, but do also highlight some limitations of triplet-based approaches and level-1 networks as a model of evolution.

We conclude this section with remarking that although Lev1athan can be applied to both weighted and unweighted triplet sets for clarity we will restrict our exposition to unweighted triplet sets.

2. Basic Concepts

We start with some concepts from graph theory concerning directed acyclic graphs. Suppose is a directed acyclic graph, or DAG for short. Then is called connected (also called “weakly connected”) if, when ignoring the directions on the arcs, there is a path between any two vertices of . Moreover, is called biconnected if contains no vertex whose removal disconnects . A biconnected subgraph  of a graph  is said to be a biconnected component if there is no biconnected subgraph  of that contains . A biconnected component is said to be nontrivial if it contains at least three vertices. A vertex of is called an ancestor of a vertex if there is a directed path from to . In this case, is also called a descendant of . Now suppose that and are two distinct vertices of . Then a vertex is a lowest common ancestor of and if is an ancestor of both and and no descendant of is also an ancestor of and .

A (phylogenetic) network on some finite set of taxa is a DAG with a single root (a vertex with indegree 0), whose leaves (vertices with outdegree 0) are bijectively labelled by the elements of . Following common practice, we identify each leaf with its label. Thus, the leaf set of , denoted , is . Vertices with indegree 1 are called split vertices and vertices with indegree at least two are called reticulations. A network is called semi-binary if each reticulation has indegree 2 and binary if in addition each reticulation has outdegree 1 and each split vertex outdegree 2. We will restrict our exposition to binary networks, but remark that Lev1athan uses post-processing to simplify a constructed network by collapsing arcs while making sure that triplet consistency is preserved. This can create vertices with outdegree greater than two.

A semi-binary network is said to be a level- network, if each biconnected component contains at most  reticulations. Clearly, if then this implies that is a phylogenetic tree in the usual sense (see e.g. [24]). It is not difficult to verify that for , a nontrivial biconnected component of a level- network always has the form of two directed paths that both start at some vertex and both end at some vertex but which are otherwise vertex disjoint. Such a biconnected component is often called a reticulation cycle or a gall. We define the size of a gall to be the number of vertices in it i.e. the number of outgoing arcs plus one. The network in Figure 2(a) contains a single gall of size 7, for example. Let be a phylogenetic network. An arc of is said to be a cut-arc if removing disconnects . A cut-arc is called trivial if its head is a leaf. A level- network is said to be a simple level- network, if contains no nontrivial cut-arcs and is not a level- network. Thus, a simple network can be thought of as a single gall with some leaves attached to it, as in Figure 2(a). Simple networks are the basic, recursive building blocks of networks, in the sense that each gall of a network essentially corresponds to a simple network.

A (rooted) triplet is a phylogenetic tree on such that the lowest common ancestor of and is a proper descendant of the lowest common ancestor of and , see Figure 2(b). For convenience, we call a set of triplets a triplet set. A triplet is said to be consistent with a network (interchangeably: is consistent with ) if contains a subdivision of , i.e. if contains distinct vertices and and pairwise internally vertex-disjoint paths , , and . For example, the network in Figure 2(a) is consistent with (among others) the triplets , and  but is not consistent with (among others) the triplets  and . We say that a triplet set is consistent with if all triplets in are consistent with and denote the set of all triplets on consistent with by .

Figure 2. (a) Example of a simple level-1 network on and (b) the triplet .

We note that, whenever a network contains a gall of size 3, can be replaced by a single split node to obtain the network where . In such cases we prefer the latter construction to the gall because it is a more parsimonious response to the input data. We henceforth sharpen the above definition of a level-1 network to exclude galls of size 3. (Galls of size 2 are already excluded because we do not allow multiple arcs).

3. A brief outline of Lev1athan

In this section with present a rough outline of our 4 phase program Lev1athan which is implemented in Java and freely available for download [12]. It takes as input a set of triplets or, more generally, a set of phylogenetic trees (e.g. gene trees) given in Newick format [8]. In the latter case the (weighted) union of the triplet sets induced by the trees in is taken by Lev1athan as triplet set . It outputs a (possibly post-processed) level-1 network in DOT format [11] and/or eNewick format [27]. The goal of Lev1athan is to construct a level-1 network that maximizes the number of triplets in that are consistent with it. In an optional post-processing phase the generated network can then be simplified by contracting all arcs of where neither nor is a reticulation, is not a leaf, such that the contraction would not cause a triplet in that was consistent with to become not consistent with it. (Leaving such arcs uncontracted is tantamount to an unfounded strengthening of our hypothesis about what the “true” evolutionary scenario looked like). To explain the program’s four phases, we require some more definitions and notation. To this end, suppose is a set of triplets and define the leaf set of to be the set . For any subset , we denote by the set of triplets in such that .

0:  : Triplet set .
0:  : Level-1 network heuristically optimizing the number of triplets consistent with .
1:  Partitioning the leaf set: Find a partition  of . This will be detailed in Section 4.
2:  Gall construction: Construct a simple level-1 network  with leaf set . This will be detailed in Section 5.
3:  Recursion: Recursively call Lev1athan to construct, for all , a level-1 network from the triplet set .
4:  Merging: Construct a network  by combining  with  as follows. For , identify leaf  of  with the root of .
Algorithm 1 Basic outline of Lev1athan

Note that the algorithm does not backtrack in the sense that it never revises earlier made decisions. Also note that while the recursion and merging phases are relatively simple, the partition and gall construction phases are not. For these phases we designed new algorithms, which will be described in the following two sections.

4. Partitioning the Leaf Set

Based on their order of priority, we next describe the three steps that make up the partitioning strategy employed by Lev1athan to find a suitable partitioning of the leaf set of a triplet set. To explain this strategy in detail, assume for the rest of this section that is a set of triplets and put .

The first step of Lev1athan’s partitioning strategy is to determine whether an Aho move is possible for . This move is based on an algorithm originally introduced by Aho et. al. in [1]. Following [24] where this algorithm is referred to as Build algorithm, this algorithm relies on the clustering graph induced by on any subset . For the convenience of the reader, we remark that for any subset the vertex set of is and any subset forms an edge in if there exists some such that .

Aho move: Construct the clustering graph . If is not connected then use the vertex sets of the connected components of as the blocks of the partition . Otherwise, say that the Aho move is not successful.

Note that the Aho move is essentially the embedding of the Build algorithm inside Lev1athan. Always starting with such a move means that, if a set of triplets is consistent with a phylogenetic tree, then a phylogenetic tree will be output. More generally it means that Lev1athan will potentially produce level-1 networks with treelike regions. Also note that if the Aho move is successful then the Gall construction phase of Lev1athan is not necessary: the components of the clustering graph will simply correspond to subnetworks that the algorithms “hangs” from a single split vertex, just like the Build algorithm. (This can lead to split vertices with more than two children, which as already mentioned Lev1athan supports via post-processing).

Our motivation for prioritising the Aho move is twofold. Firstly, it adheres to the parsimony principle: if the data can be explained equally well by both a phylogenetic tree and a network, choose the tree. As a non-trivial example, note that the triplet set induced by the phylogenetic tree on depicted in Figure 3(a) is also consistent with the level-1 network on depicted in Figure 3(b) and that is in some sense minimal with this property, that is, no arc of can be deleted such that the triplets in remain consistent with the resulting phylogenetic tree. Secondly it exploits the not entirely trivial observation that, when attempting to construct a level-1 network consistent with the maximum number of triplets in a given set, it can never be suboptimal to make an Aho move, when this is possible [25].

Figure 3. The triplets in are also consistent with and is minimal with this property.

If the Aho move is not successful, the next step is to try a JNS move. To explain this move we require a further concept which was originally introduced by Janson, Nguyen and Sung in [16]. Suppose . Then is called an SN-set (of ) if there exists no triplet in such that and . A SN-set of is called maximal if and there does not exist an SN-set such that . Also, for some partition of into blocks , , we define the induced triplet set of as the set of all triplets on for which there exists a triplet where and and , and are all distinct. Note that is dense if is dense.

JNS move: If (a) the set of maximal SN-sets of is a partition of , and (b) is a dense set of triplets that is consistent with some simple level-1 network , then define to be the sought after partition (and use as in the Gall construction phase). Otherwise and as in the case of an Aho move, say that the move is not successful.

The JNS move is essentially the level-1 analogue of the above described Aho move. The main difference is that although, using for example the Build algorithm, it is always possible to decide in polynomial time if a triplet set is consistent with a phylogenetic tree it is in general NP-hard to determine whether there is a level-1 network that is consistent with the set (even if we restrict to simple level-1 networks). However, the situation changes and becomes decidable in polynomial time if the triplet set in question is dense [16]. Density of the induced triplet set is therefore a necessary requirement for a successful JNS move. Hence, requirement (b). To motivate requirement (a), note that for dense triplet sets the set of maximal SN-sets of always forms a partition of . For general (i.e. non-dense) triplet sets this is not always, but sometimes, true. Requirement (a) is thus included to extend JNS moves to non-dense triplet sets.

We conclude the discussion of the JNS move with remarking that this move enjoys the same optimality properties as an Aho move. More precisely, prioritising the JNS move guarantees that a level-1 network consistent with all the original triplets will be produced if such a network exists and the original triplet set was dense. (And, less obviously, that making a JNS move can never lead to a suboptimal network, assuming subsequent recursive steps give optimal networks). Again analogous to a successful Aho move, a successful JNS move allows the full generality of the Gall construction phase (see below) to be bypassed by utilizing the already computed simple level-1 network .

Due to noise in real biological data (or the inherent complexity of the underlying network) however, it is generally too much to hope for that one of the two moves described so far will always be successful. We therefore have adapted a strategy from [16] to obtain a third move which we call the Heuristic move. The heart of this move is formed by a score function , a partition of the leaf set of , from [16] plus two operations (P1) and (P2) that will allow us to manipulate a partition with the aim of ensuring that the score of the new partition is at least as good as the score of the given partition, i.e. holds. To describe both the score function and these two operations, we require some more concepts. Suppose is a partition of . Then to we associate the following four subsets

Note that clearly forms a partition of . The aforementioned score function is then defined as . Denoting as above a given partition by and the generated partition by then the two aforementioned operations (P1) and (P2) are defined as follows.

  1. If and then .

  2. If with and then .

Armed with these definitions, we are now ready to state the Heuristic move.

Heuristic move: Starting with we exhaustively search for an operation (P1) or (P2) which, when applied to , will create a partition with . If no such exists then the sought after partition is and we are done. Otherwise put and repeat.

Note that the Heuristic move is guaranteed to terminate in polynomial time and that it will never return as the final partition [16]222To avoid finishing with we allow in the first, and only the first, iteration an operation to be applied as long as this does not decrease the score. After this operations are only permitted if they strictly increase the score.. Also note that the Heuristic move can generate partitions with 4 or more blocks (as long as this raises the score). This is in contrast to its analog in [16, page 1118] where the number of blocks in a partition is restricted to 3. Arguably somewhat unassuming looking this restriction to 3 blocks guaranteed that for any triplet set a level-1 network could be constructed which was consistent with a fraction of the triplets in [16]. Interestingly - and although the system of inequalities underpinning the 5/12 result also holds in the case when there is no restriction on the number of blocks, as in this case - a simple level-1 network (the construction of which is the purpose of the second phase of Lev1athan) is in the worst case consistent with no more than of the triplets in a given set. An example in point is a partition where each block is of size one. For such a partition the guaranteed lower bound in the worst-case tends to 1/3. So removing the restriction to 3 blocks can theoretically undermine the 5/12 lower bound of [16]. However, and as suggested by examples, the dropping of the 3-block restriction allows in practice the construction of a wider range of networks (and network topologies) that are consistent with a higher percentage of triplets (see the supplementary data section of [12]). If the user nevertheless requires the 5/12 lower bound to be mathematically guaranteed, then this can be ensured by limiting the number of blocks to at most 9.

5. Constructing Simple Level-1 Networks

In this section, we turn our attention to the second phase of Lev1athan which is concerned with constructing a simple level-1 network from a triplet set such that the number of triplets in consistent with that network is maximized. Since this optimization problem is NP-hard [16] in general, we propose to do this heuristically. (We remark that for small instances Lev1athan will also compare the heuristically computed simple level-1 network with an optimal tree computed using Wu’s exponential-time algorithm [32]. If the tree is consistent with at least as many triplets as the simple level-1 network then Lev1athan will return the tree, parsimony again being the motivation for this. We will not discuss this step further). Once again we emphasise that if Lev1athan chooses for an Aho or JNS move when partitioning the leaf set of a triplet set, then the algorithms described in this section will not be used by Lev1athan.

To be able to describe our heuristic we need to generalize our notion of consistency which we will do next. Suppose that is a triplet set and that is a partition of the leaf set of . Suppose also that are distinct elements in and that is a network with leaf set . Then we say that is consistent with if there exist distinct sets with such that is consistent with . Furthermore, we say that  is inconsistent with  if  is not consistent with  but there do exist distinct sets with , and . Note that, with this slightly generalized definition, it is possible that a triplet in is neither consistent nor inconsistent with i.e. when contains two or more leaves that lie in the same block of .

We are now in the position to briefly outline the second phase of our heuristic. Suppose is a triplet set and is a partition of . Then if the cardinality of is moderate (by default: at most 12) we compute an exact optimal solution in exponential time. This exact algorithm is described in Section 5.1. If the cardinality of  is too big to compute an exact optimal solution, we use the greedy algorithm described in Section 5.2. Note that, although stated in terms of a triplet set and a partition of (i.e. the partition chosen by the previous step of Lev1athan), both algorithms can also be used for general leaf- and triplet sets: for an arbitrary set of triplets the partition of consisting of only singletons can be taken as .

5.1. Exact Algorithm

Van Iersel et al. [31] proposed an exponential-time exact algorithm to find a level-1 network consistent with a maximum number of triplets of a given set in  time where and . This section describes how this running time can be improved to if an algorithm searches only for simple level-1 networks. To describe such an algorithm we require two more concepts that concern phylogenetic trees. A phylogenetic tree is said to be a caterpillar if the parents of the leaves form a directed path. We say that a caterpillar  ends in leaf if the sibling of is also a leaf of (each caterpillar thus ends in exactly two leaves). For example the phylogenetic tree depicted in Figure 3(a) is a caterpillar that ends in leaves and .

Our algorithm, presented in the form of pseudo-code in Algorithm 2, consists of 2 steps and takes as input a triplet set and a partition of the leaf set of . It returns an optimal (in a well defined sense) simple level-1 network on . It works by essentially trying each set as the leaf below the reticulation of the simple level-1 network to be constructed. More precisely and with and as above, the first of its 2 steps is as follows. For each set we do the following. For each subset  with , we first compute an optimal caterpillar  ending in  in the sense that the number of triplets in consistent with the caterpillar is maximal. To achieve this, note that the caterpillar with consists of a root and 2 distinct leaves each of which is the head of an arc starting at the root (lines 2-4). To find the other caterpillars, we loop through all subsets with from small to large, starting at the subsets of size three. For each such subset of , we loop through all elements and create a candidate caterpillar by creating a new root and two arcs leaving the root: one to  and one to the root of  (lines 5-9). Among all candidate caterpillars, we then select a caterpillar that is consistent with a maximum number of triplets in .

Once all the caterpillars have been constructed, an optimal simple level-1 network is found in the second step as follows. We loop through all bipartitions  of . First suppose that  and  are both nonempty. Then we combine the caterpillars  and  into a candidate simple level-1 network  as follows. Let  be the arc entering  in , let  be the arc entering  in  and let  denote the roots of  and  respectively. We add a new root  and a new vertex  and replace arcs  and  by new arcs  and  (lines 10-14). See Figure 4(a) for an example of this construction.

Figure 4. (a) Construction of  from and  if . (b) Construction of  from  if .

Now suppose that . Let again  be the arc entering  in  and let  be the root of . We add a new root  and a new reticulation  and replace arc  by arcs  and . This leads to the candidate network . See Figure 4(b) for an example of the construction in this case. Finally, we select a network consistent with a maximum number of input triplets, over all constructed candidates and return that network.

0:  : Triplet set and a partition of the leaf set of .
0:  : Simple level-1 network with leaf set consistent with a maximum number of triplets in .
1:  for  do
2:     for  do
4:         is the caterpillar consisting of a root, the vertices and , and 2 arcs both starting at the root and one ending in and the other in .
5:     for  do
6:        for each with  and  do
7:           for  do
8:               is the caterpillar consisting of the caterpillar , the leaf , a new vertex (the root of ), and two new arcs both starting at that vertex and one ending in   and the other in the root of 
9:            is the caterpillar that is consistent with a maximum number of input triplets over all 
10:     for all bipartitions  of  do
11:        if  then
12:           combine caterpillars  and  into a candidate network  as in Figure 4(a).
13:        if  then
14:           transform caterpillar  into a candidate network  as in Figure 4(b).
15:  return a network  that is consistent with a maximum number of input triplets, over all constructed candidates 
Algorithm 2 Exact Simple Level-1 Network Construction

A straightforward analysis of the above algorithm implies the following result.

Theorem 1.

Given a triplet set  with and , a simple level-1 network consistent with a maximum number of triplets from  can be constructed in  time.

We now turn our attention to presenting our greedy algorithm which follows the same philosophy as the previous algorithm i.e. try each set in as the leaf below the reticulation of a simple level-1 network to be constructed.

5.2. Greedy Algorithm

With and as above, we next give the details of our greedy algorithm, which we present in the form of pseudo-code in Algorithm 3. In the context of this, it should be noted that a similar strategy was shown to perform particularly well in an algorithm by Snir and Rao for the construction of phylogenetic trees from triplets [25].

For each set , we first create an initial network  with three vertices  and  and three arcs  and  (lines 2-4), where the arc occurs twice. To this network we then add the other elements from  in a greedy fashion and each time renaming the resulting network . To decide which element to insert first, and where to insert it i.e. into which non-trivial arc  of , we use a score function . To present this score function suppose that and are vertices of such that is a non-trivial arc of and that . Let  denote the network obtained from  by removing arc  and adding two vertices  and  and three arcs  and  to . Then the score is equal to the number of triplets with that are consistent with  minus the number of triplets with that are inconsistent with . In other words,

Note that the definition of does not consider triplets for which contains at least one element in that has not yet been added to the network. Also, the definition does not consider triplets that are neither consistent nor inconsistent with . (This is because the role of such triplets in the final network is entirely determined by the choice of and choices made in later recursive steps).

Suppose and are such that is maximized. Then we construct a simple level-1 network  and insert the remaining leaves into this network by the same method (lines 5-8). Finally and by searching through all constructed simple level-1 networks we return that the network that maximizes the number of triplets in it is consistent with (lines 9-10).

The algorithm can thus be summarised as follows.

0:  : Triplet set and a partition of the leaf set of .
0:  : Simple level-1 network on (that heuristically optimises the number of triplets consistent with ).
1:  for  do
5:     while  do
6:        compute  for each  and each nontrivial arc  of 
7:        find  maximising 
9:     let  be the number of triplets from  consistent with 
10:  let  maximise  over all .
11:  return
Algorithm 3 Greedy Simple Level-1 Network Construction

6. Experiments

To better understand the behavior of Lev1athan, we performed a biologically-motivated simulation study using triplet sets consistent with level-1 networks of different size and complexity and various amounts of missing data (experiment 1) and of noise (experiment 2). To ensure not only variability of the input triplet sets for Lev1athan but also consistency with a level-1 network, we used a novel algorithm for random level-1 network generation. After giving a general outline of our simulation study in the next section, we describe this algorithm in Section 6.2. To model missing data and noise, rather then using the triplet set induced by such a network we used a triplet set as input for Lev1athan where is a parameter that governs the amount of missing data/noise in . Details on the precise definition of will be given in the next section. Using a range of measures which we describe in Section 6.3 we present the outcomes of our study in Section 6.3.1 in case of missing data and in Section 6.3.2 in case of noise.

6.1. General outline of our simulation study

Our simulation study consists of two experiments each of which is motivated by a situation one might encounter in a triplet based phylogenetic network analysis. The full results of (and inputs to) both experiments are available in the supplementary data section of [12]. The first experiment (Section 6.3.1 - missing data) deals with the situation that only some of the triplets induced by some unknown level-1 network are given. This phenomena is modeled by setting to be a randomly selected subset of . The second experiment (Section 6.3.2 - noise) addresses the situation that the confidence level one might have in the triplets generated in a phylogenetic analysis might vary. In our experiment this is modeled by adding noise to . Put differently, we essentially construct by randomly selecting triplets from and for each such selected triplet randomly “flipping” it to one of the two other possible triplet topologies on .

For both simulation experiments, the general outline is as follows. We first choose some level-1 network as the “seed” for our random level-1 generator algorithm. For the generated level-1 network , we then constructed the triplet set from and use as input for Lev1athan. The level-1 network generated by Lev1athan we then compared against with regards to topology and also the four measures described in Section 6.3. In each experiment we used a total of 110 randomly generated level-1 networks with leaf set size ranging between and and number of reticulations ranging between and . The running time of Lev1athan on a standard desktop computer ranged from 2-3 seconds for the simplest networks to 30 seconds for the most complex ones.

6.2. Generating random level-1 networks

A survey of the literature suggested the NetGen [19] program to be the only available approach for systematically generating networks. Whilst NetGen addresses the issues of size (i.e. number of vertices) and network complexity one would encounter when manually constructing networks, one of its main drawbacks is the lack of guarantee that the generated network is level-1 (although some happen to be level-1 networks). One way to overcome this problem is to hand pick suitable networks from a (relatively small) subset of generated networks. Alternatively the list of networks generated by NetGen could be post-processed by manually removing a sufficient number of reticulations from each network in to obtain a level-1 network. A further and maybe more important drawback of NetGen is that the structure of a gall in a generated network is rather simple in the sense that its size is 4. We therefore developed our own algorithm for generating level-1 networks. This algorithm is implemented in Java and freely available for download from [12]. We next describe this algorithm.

Our algorithm for generating level-1 networks takes as input a level-1 network on a fixed number of leaves plus a positive integer and outputs a level-1 network on a larger leaf set. A pseudo-code form of the algorithm is presented in Algorithm 4.

0:  : A simple level-1 network and a positive integer .
0:  : Level-1 network
2:   empty graph
3:  for  do
4:      generate instance of
6:  for all  do
7:     relabel the leaves in such that no two networks in have the same leaf set,
8:     randomly choose some integer in
9:     randomly delete vertices from
11:  for  do
12:     randomly choose a leaf of and replace it with the network
13:   number of leaves in
14:  randomly choose some integer in
15:  randomly delete leaves and cherries from where, as usual, a cherry of a network is a set of leaves and of such that the parent of is also the parent of .
16:  return
Algorithm 4 Level-1 network generator

We remark that the size of the gall in clearly influences the variability of networks generated by this algorithm. Also, we remark that for the purpose of the discussed simulation study was the level-1 network depicted in Figure 5(a).

Figure 5. The figure illustrates the generation of a level-1 network via our level-1 network generator algorithm (c. f. Algorithm 4). The initial network is depicted in (a). In the discussed example, for the network pictured in (a). The networks presented in (b) and (c) are obtained by randomly removing vertices from two instances of the network in (a). The network depicted in (d) is the network generated by replacing leaf in the network pictured in (b) with the network shown in (c).

Algorithm 4 starts by generating instances of and storing them in set (lines 3 – 5). Next (lines 6-9), for each we first randomly chose an integer with and then randomly delete vertices , from making sure that no deleted vertex is the root or a leaf of . If is a reticulation then we randomly choose one of the parents of , say , and add a new arc from to the unique child of . For any other deleted vertex we reconnect its unique parent with one of its children. The child to be reconnected is selected randomly, but a child that is not a leaf is always preferred over a child that is a leaf. We initialize our output level-1 network with the network (line 10) and then iterate over , where . At each iteration we randomly select a leaf from network and replace it by (line 12) yielding a new level-1 network . Finally, we randomly remove a random number of leaves and cherries from (lines 13-15) and then return the resulting network which we also call . We conclude the description of Algorithm 4 by making the trivial observation that the size of depends on the number of .

To illustrate the inner workings of the level-1 network generator algorithm suppose is the level-1 network with leaf set depicted in Figure 5(a) with and suppose that . Then we first generate instances of that network. The deletion of vertices from each of that networks as described in line 9 of Algorithm 4 results in the networks depicted in Figures 5(b) and 5(c). Next, leaf in the network depicted in Figure 5(b) is replaced with the network pictured in Figure 5(c). The resulting network is depicted in Figure 5(d). Note that due to the small number of leaves of subNet, the operation of removing random leaves and cherries from that network (lines 13-15) is not executed.

6.3. Measures

Reflecting the fact that our simulation study is aimed at assessing the reconstructive power of Lev1athan by measuring the similarity between a level-1 network and the level-1 network generated by Lev1athan when given , we considered 4 different measures. To be able to describe these measures, we need to fix some notation first. For the remainder of this section and unless stated otherwise, let be a level-1 network and let denote the network that Lev1athan generates when given as input.

The first measure is the network-network triplet consistency measure. For and we define the network-network triplet consistency measure as the quantity

Loosely speaking, is the proportion of “definitely correct” triplets (i.e. ) that are consistent with . Thus is always a real number between and . A variant of this, the triplet-network triplet consistency measure , pays less heed to the origin or accuracy of the input triplets and is defined for an arbitrary triplet set and a phylogenetic network by putting

In other words, is the fraction of triplets in that is consistent with the network . Note that this measure is different from the triplets distance introduced in [6].

The third of our triplet based measures is inspired by the quartet distance for unrooted phylogenetic trees [3] and is called the network-network symmetric difference. For and the network- network symmetric difference between and is defined as the quantity

is thus the number of triplets that appear in but not in , or vice-versa. Note that in this measure and are compared with regards to their induced triplet sets, while was generated in response to the set derived from . As already noted above for the network-network triplet consistency measure, the network-network symmetric difference measure also suggests a natural variant of it, , defined for an arbitrary triplet set and a network by putting .

Regarding the and measures it should be noted that the former might be more powerful than the latter. To see why consider again the example of the triplet set induced by the caterpillar on depicted in Figure 3(a). As already pointed out earlier, the level-1 network on depicted in Figure 3(b) is also consistent with and no arc of can be deleted from so that consistency with is preserved. If we take , then (in the context of network-network triplet consistency) and are equally good level-1 networks for representing since . However, under the network-network symmetric difference measure would be preferable over , because of precisely that proper subset relationship.

The final measure we considered is the -distance which was originally introduced in [7]. To define this measure which was shown in [5, 7] to be a metric for a certain class of networks (i.e. tree-child networks), which includes the class of level-1 networks as a subclass, we require some more notation. Suppose is a phylogenetic network on some set , , and is a vertex of . Then the vector can be associated to where for all the quantity represents the number of different paths in from to leaf . With denoting by the multiset of all vectors , a vertex of , and calling the -representation of , the -distance between any two phylogenetic networks and is defined as

where the symmetric difference operator is defined here over multisets.

Armed with these measures and our algorithm for randomly generating level-1 networks, we are now ready to describe the results of our simulation study. (We note that, because the source networks used in the simulation had no vertices of outdegree 3 or higher, and for technical reasons, the optional post-processing phase used by Lev1athan was switched off during these simulations). Assume then from now on that is a level-1 network generated by our random level-1 network generator described in Algorithm 4 and that the definition of the network is as before. We start with presenting our results for the missing data experiment.

6.3.1. Simulation results - missing data experiment

Central to this experiment is the parameter which indicates the probability that a triplet in will be included in . The values of that we investigated were (i.e. all triplets in ), .

Modulo a well-understood exception (described below) all networks generated via Algorithm 4 were recovered correctly by Lev1athan when . This is a consequence of the fact that Lev1athan prioritises JNS moves and that a level-1 network is completely defined by up to, but not including, galls of size 4 [10]. This is a drawback of any triplet based phylogenetic network approach since such approaches have to make a choice between the three galls on a set {} that are all consistent with . This inability to distinguish between the topologies of size-4 galls does not, of course, affect the triplet-based measures, and for we had indeed in all cases that and . When Lev1athan correctly guessed the topologies of all size-4 galls in a network we additionally had , but this value became non-zero when the guess was incorrect.

For all other values of and all networks , we frequently observed that - although similar when inspected visually - some of the galls from were not reconstructed correctly in the generated level-1 network , in the sense that the size of a gall in was smaller than in the corresponding gall in (e.g. a size 5 gall in became size 4 gall in , see Figure 6). Although this observation is clearly dependent on the specific value used for , it generally meant that one or more vertices had been pushed out of a gall in to its sides, causing what we will call typical gall damage for . In turn this means that subnetworks hanging from galls are sometimes merged by Lev1athan into a single subnetwork. Even in the presence of typical gall damage, however, Lev1athan sometimes (but not always) correctly inferred which leaf of needed to be placed below the reticulation of the damaged gall. Interestingly, we also observed that typical gall damage was rare for galls - even for low vales of - if the subnetworks hanging from contain many leaves. Expressed differently, the likelihood that a gall in suffers typical gall damage for is higher if the gall is closer to the leaves. The reason for this might be that is supported by far fewer triplets then a gall closer to the root.

It should be noted that the exception to the former of the last two observations is the galled caterpillar network [4, 16] which can be thought of as a natural level-1 generalization of a caterpillar. Such networks were correctly reconstructed by Lev1athan for .

Figure 6. The figure illustrates typical gall damage. (a) The level-1 network on with a gall of size 5. (b) The level-1 network on the same set with the erroneous reconstruction of in terms of the gall ’.

In addition, we observed for all networks that when dropped from 1.0 to 0.9, the network-network triplet consistency between and also drops slightly (often from 1 to a value in the range 0.95 - 0.99) but immediately stabilises around that value, even for very low values of . This phenomenon even occurs when there are very few or even no Aho or JNS moves occurring, suggesting that Heuristic moves are (in this context) good for sustaining a very high value of triplet consistency. However, and for all values of other than and all networks , we also observed that even when is very close to 1, is often not recovered correctly by Lev1athan from . An example that illustrates this point is the triplet set of the level-1 network depicted in Figure 6(b). This triplet set together with the triplet is the triplet set induced by the level-1 network depicted in Figure 6(a). The network generated by Lev1athan from is the level-1 network and . Thus, rather than being a suitable tool for assessing Lev1athan’s reconstructive power, the measure might be of limited use for capturing differences between networks.

For all values of and averaged over all networks for , we observed an initial sharp rise for both the network-network symmetric difference and the -distance when drops from to . Again in both cases this initial rise is followed by a a much slower growth rate although this rate seems to be higher for the -distance. Encouragingly, we found instances of networks where for Lev1athan correctly inferred the missing triplet information, that is, correctly reconstructed from .

6.3.2. Simulation results - noise experiment

Central to this experiment is again the parameter which in this case is an error probability and which we state in terms of values between and . Its purpose is to introduce noise into and the range we considered was and . More precisely, we generated from by taking each triplet and with probability decide to corrupt it, that is, replace in with one of the 2 other phylogenetic trees on , chosen uniformly at random.

We observed that even for very low error probability values, the triplet set is unlikely to be consistent with a level-1 network. It is thus not surprising that we often observed additional (erroneous) reticulations in networks reconstructed by Lev1athan from . Nevertheless and based on visual inspection, our experiment seems to indicate that even for slightly higher values of , i. e  , and all networks , much of the structure of is recovered correctly by Lev1athan from . Having said that, and in addition to the above mentioned erroneous reticulations, typical gall damage is common in the generated networks.

For all other than 0.00 and averaged over all networks for , we observed that, as expected, decreased linearly with increasing error probability in the sense that holds. Reassuringly (and by no means obviously) this almost linear relationship is not obeyed by the network-network triplet consistency measure . By averaging for each error probability value over all networks , we summarize our findings for that measure in Figure 7 in terms of plotting the error probability values for against the corresponding network-network triplet consistency values. Intriguingly, there is an initial 2% drop after which network-network triplet consistency remains high until error rates reach values of . It should be noted that this in fact corresponds to a high level of noise in given that if is a phylogenetic tree and then is a completely randomized triplet set and all structural information contained in concerning has been lost. The most likely explanation for the initial 2% drop is probably (again) typical gall damage since, as alluded to above, a given gall represents more triplets then its a gall-damaged counterpart. The fact that after the initial drop network-network triplet consistency remains high, is encouraging, because it shows that Lev1athan holds promise for reconstructing triplets that have not been corrupted by noise, up to quite a high level of noise.

Figure 7. In the noise experiment the measure experiences an initial sharp drop of about two percent, after which it remains relatively stable until the error probability becomes large (bigger than 0.3).

For all values of and averaged over all networks for , we also observed an initial sharp increase when increases form to for both the network-network symmetric difference and the -difference. After this, the values for both measure grow very slowly with the growth rate for the -distance being higher until reaches when the growth rate of the network-network symmetric difference takes over. Intriguingly the network-network symmetric difference seems to reach a peak at and then drops of again for . However and in the light of the fact that, as pointed out above, is already a high error probability the later 2 observations should be treated with caution. A possible reason for the above mentioned initial sharp rise might be that even one corrupted triplet can potentially introduce erroneous (additional) galls in . Combined with the problem of typical gall damage, this can greatly affect the triplet sets induced by and as well as their -representations and thus the network-network symmetric difference and the -distance which rely on these concepts respectively.

Interestingly we identified some networks such that when (and additionally ) we nevertheless had that . Visual inspection revealed that in such cases was equal to . This suggests that if the noise level in an input triplet set is small enough Lev1athan might still be able to correctly reconstruct the level-1 network that induced that triplet set.

6.4. A HIV-1 virus data set

To illustrate the applicability of our approach, we re-analyzed a HIV-1 virus data set that originally appeared in [21, Chapter 14]. All but one of the sequences (KAL153) making up that data set are 50 percent consensus sequences for the HIV-1 M-group subtypes , , and with the KAL153 strain being thought to be a recombinant of subtypes and . Recombinants such as KAL153 can essentially be thought of as having arisen via the transfer and integration of genetic material from one evolutionary lineage into another. The positions in an existing sequence where the foreign genetic material was integrated are generally called breakpoints and many approaches for detecting recombination aim to identify these breakpoints.

For the above data set two breakpoints were identified (positions 2700 and 8926) in [21, Chapter 14]. Furthermore, for the three induced sub-alignments 1-2699, 2700- 8925, and 8926-9953 the Neighbor Joining method [20] (with subtype as outgroup and the K2P model [18]) was used to represent the data in terms of arc-weighted phylogenetic trees (see [21, page 159] for a depiction of those trees). Since the resolution patterns for and in the first tree, and in the second, and and in the third tree was not clear, we recomputed those trees using the above settings. Reassuringly and when arc weights were ignored, this resulted in the same phylogenetic trees for the first and third sub-alignment as in the previous analysis except that the unresolved vertex in each tree was now resolved. For the second sub-alignment, the same tree was obtained. For the convenience of the reader, we depict these phylogenetic trees in Figure 8.

Figure 8. The phylogenetic trees for the three sub-alignments with the outgroup omitted in each case. (a) sub-alignment 1-2699, (b) sub-alignment 2700- 8925, and (c) sub-alignment 8926-9953.

As expected, the position of the KAL153 strain is the same in the first and third topology but different in the second topology. However and somewhat surprisingly the resolution order of subtypes and is in the first and third tree is different as well as the placement of subtype in the tree. Having said this, the branch weights that support these different resolution orders are very small. To therefore not allow this to overly influence our analysis (after all our method as well as the other two methods that we tried out are using phylogenetic trees rather than arc-weighted phylogenetic trees as input and therefore these different levels of support are not taken into account), we only used the first and the second and the third and second tree as respective inputs for our analysis.

Interestingly the second phylogenetic tree postulates the triplet on subtypes , and whereas the first tree hypotheses . Since this conflict also interferes with the conflicting information for subtypes , and KAL153, attempting to combine the triplet set generated from both phylogenetic trees into a general level-1 network is problematic as such a network would postulate 2 recombination events for this data set. To avoid this and at the same time identify the stronger of the two conflicting signals it is therefore more useful to construct (using Algorithm 2) an optimal simple level-1 network, which by definition has only one reticulation vertex. This type of network is depicted for the first and second tree from Figure 8 in Figure 9. As expected, the network correctly identifies the strain as a recombinant of subtypes and .

Figure 9. The optimal simple level-1 network constructed by Lev1athan for the first and second tree in Figure 8. For each taxa the percentage value denotes the percentage of input triplets containing that the network is consistent with. This can be interpreted as how “satisfied” the taxa in question is with its location in the network. The value on each arc is the deletion support value for that arc i.e. the number of input triplets that would cease to be consistent with the generated network if that arc was deleted.

It should be noted that repeating this analysis for the second and third tree in Figure 8. resulted in the same simple level-1 network as the one depicted in Figure 9 except that the order of and was reversed. The same holds when the optimal simple level-1 network is computed for the respective original phylogenetic trees from [21, Chapter 14] when ignoring arc-weights. Interestingly and in case of the second and the third tree, exclusion of subgroup also resulted in a simple level-1 network that correctly identified the KAL153 as recombinant. However this was not the case when this analysis was repeated for the first and second tree.

We conclude this section with remarking that although using different philosophies, the other two known approaches i.e. Cluster networks and Galled networks (both of which are implemented in Dendroscope [14]) also had problems with this data set, postulating between 2 and 4 recombination events (data not shown).

7. Conclusions

In this paper, we have presented a heuristic for constructing level-1 networks from triplet sets which we have also implemented in Java as the freely available program Lev1athan [12]. Guided by the principle of triplet consistency, our heuristic aims to optimize a well-defined objective function on triplet sets without generating pessimistically complex networks for them. By running in polynomial time and always returning a level-1 network, it addresses several of the problems that frequently occur with existing network algorithms from such input sets. Using both a biologically motivated simulation study and a biological data set, we have also explored Lev1athan’s applicability to real biological data.

Based on the outcomes of our simulation study, it appears that our heuristic is able to recover, in terms of the triplet set induced by the generated level-1 network, a high percentage of the input triplets that were also present in the original network (as opposed to triplets that were missing or that had been corrupted). This is nota bene also true when the input triplet set is not dense, which (in light of the NP-hardness of the non-dense case) is an encouraging observation. On the other hand (and probably not surprisingly as our heuristic tries to reconstruct a global structure from local information) it seems that it is more vulnerable to noise in an input triplet set than missing data. The most probable reason for this is that the former type of problem can sometimes be rectified via implicitly inferred triplet information, whereas the latter type of problem has the capacity to actively mislead. Having said that, Lev1athan shows encouraging potential if the amount of missing data is low or there is only very little noise in the data.

In general using more of the triplets induced by a network as input for Lev1athan allows larger regions of that network to be recovered by it. Having said that, when confronted with missing data or noise, even extremely high values of network-network triplet consistency (e.g. above 0.95) do not preclude non-trivial differences between the original network and the network generated by Lev1athan. Additionally, the noisier a input triplet set for our heuristic is, the greater the chance that the network found by it is distorted (e.g. through the emergence of surplus galls in the generated network). To tackle this problem Lev1athan has the option to label each arc of the generated network with its deletion support value, see Figure 9. This allows the user to explore which reticulations are weakly supported, and thus might be superfluous or even artefacts of our heuristic.

Although the four triplet-based measures used in this article appear to be very natural they only seem to be of limited use for capturing network differences in general. However some of them helped to identify cases where a network generated by Lev1athan from coincided with the original network .

Our re-analysis of a biological data set from [21] using Lev1athan indicated that this data set is more complex than it appears at first sight. The resulting conflicting triplet infomation misled Lev1athan (and also the other two network construction approaches that we tried) to postulate a too complex evolutionary scenario when using its default option of generating a level-1 network. However Lev1athan’s simple level-1 network option had no problem with this data set and was able to correctly reconstruct the data set’s widely accepted evolutionary scenario.

To understand better how well Lev1athan performs, it will be necessary to compare it directly with an alternative method for network construction that uses similar input and also produces a level-1 network. Such methods are lacking at the moment. Similarly, it is at the moment difficult to draw conclusions regarding the biological meaning of measures such as the -distance, which we used in our simulation study. Comparison should also be made between Lev1athan and other programs when the input is strictly simpler, or strictly more complex, than level-1 networks. Figures 10 and 11 already provide interesting insights. Figure 10 shows a level-2 network created by the Level2 algorithm of [28] which is consistent with all 1330 triplets in the yeast dataset discussed in that same article. For the same dataset Lev1athan constructs the level-1 network in Figure 11; this is consistent with of the 1330 triplets. Both networks cluster the taxa together similarly; the main difference is in the Lev1athan network taxa 12 and 5 have been pushed further away from the root, whilst taxon 8 has risen closer to the root. However which one of these two networks is biologically more relevant is not immediately clear. In any case it should once again be noted that Lev1athan is in many regards much more flexible than the Level2 algorithm (and related algorithms such as [26, 29, 30]) because Lev1athan always quickly returns a network and it does not require the input triplets to be dense or fully consistent with the output network. (In particular, the authors of [28] had to use time-consuming brute-force techniques to first find a subset of the taxa that induced a triplet set fully consistent with a level-2 network).

It is also necessary to look more deeply at the underlying mathematics of Lev1athan. For example, the partitioning strategy at its heart (i.e. Phase 1) is a modification of a strategy that was originally optimized for worst-case performance, not average-case performance. Yet this strategy also seems to perform surprisingly well in the average case. Understanding why this is, and developing a better appreciation for the theoretical limits and strengths of triplet methods as a mechanism for reconstructing network topologies, remains a fascinating area of research.

Figure 10. A level-2 network on {} constructed by the LEVEL2 algorithm [28] for a blinded yeast dataset also described in that paper. This network is consistent with all 1330 triplets in the dataset.
Figure 11. The level-1 network constructed by Lev1athan when given the same dataset as discussed in Figure 10. This network is consistent with of the 1330 input triplets.

8. Acknowledgements

We thank Vincent Moulton for many very helpful conversations during the writing of this paper. Leo van Iersel was funded by the Allan Wilson Centre for Molecular Ecology and Evolution and Steven Kelk by a Computational Life Sciences grant of The Netherlands Organisation for Scientic Research (NWO).


  • [1] A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981.
  • [2] O. Bininda-Emonds, editor. Phylogenetic Supertrees: Combining information to reveal the Tree of Life. Kluwer Academic Publishers, 2004.
  • [3] D. Bryant, J. Tsang, P. E. Kearney, and M. Li. Computing the quartet distance between evolutionary trees. In SODA, pages 285–286, 2000.
  • [4] J. Byrka, P. Gawrychowski, K. T. Huber, and S. Kelk. Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks. Journal of Discrete Algorithms, 2009. To appear.
  • [5] G. Cardona, M. Llabrés, F. Rosselló, and G. Valiente. Metrics for phylogenetic networks I: Generalizations of the robinson-foulds metric. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(1):46–61, 2009.
  • [6] G. Cardona, M. Llabrés, F. Rosselló, and G. Valiente. Metrics for phylogenetic networks II: Nodal and triplets metrics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(3):454–469, 2009.
  • [7] G. Cardona, F. Rosselló, and G. Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. To appear, http://arxiv.org/abs/0708.3499.
  • [8] J. Felsenstein, J. Archie, W. H. Day, W. Maddison, C. Meacham, F. J. Rohlf, and D. Swofford. The newick tree format, 1986.
  • [9] P. Gambette. Who’s who in phylogenetic networks, 2009. http://www.lirmm.fr/~gambette/PhylogeneticNetworks/.
  • [10] P. Gambette and K. T. Huber. A note on encodings of phylogenetic networks of bounded level. Technical Report arXiv:0906.4324, June 2009.
  • [11] E. Gansner, E. Koutsofios, and S. North. Drawing graphs with dot. Technical report, AT&T Bell Laboratories, 2006.
  • [12] K. T. Huber, L. van Iersel, S. Kelk, and R. Suchecki. LEV1ATHAN: A level-1 heuristic, September 2009. http://homepages.cwi.nl/~kelk/lev1athan/.
  • [13] D. Huson. Reconstructing Evolution - New Mathematical and Computational Advances, chapter Split Networks and reticulate networks. Oxford University Press, 2007.
  • [14] D. Huson, D. C. Richter, C. Rausch, M. Franz, and R. Rupp. Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics, 8(1):460, 2007.
  • [15] D. H. Huson and R. Rupp. Phylogenetic Networks. Cambridge University Press. To appear.
  • [16] J. Jansson, N. B. Nguyen, and W.-K. Sung. Algorithms for combining rooted triplets into a galled hylogenetic network. SIAM Journal on Computing, 35(5):1098–1121, 2006.
  • [17] J. Jansson and W.-K. Sung. Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science, 363(1):60–68, 2006.
  • [18] M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111–120, June 1980.
  • [19] M. M. Morin and B. M. E. Moret. Netgen: generating phylogenetic networks with diploid hybrids. Bioinformatics, 22(15):1921–1923, 2006.
  • [20] M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, 2000.
  • [21] M. Salemi and V. A. M. (ed). The phylogenetic handbook. A practical approach to DNA and protein phylogeny. Cambridge University Press, UK, 2003.
  • [22] H. A. Schmidt, K. Strimmer, M. Vingron, and A. von Haeseler. Tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics, 18(3):502–504, 2002.
  • [23] C. Semple. Reconstructing Evolution - New Mathematical and Computational Advances, chapter Hybridization Networks. Oxford University Press, 2007.
  • [24] C. Semple and M. Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, 2003.
  • [25] S. Snir and S. Rao. Using max cut to enhance rooted trees consistency. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(4):323–333, 2006.
  • [26] T.-H. To and M. Habib. Level-k phylogenetic networks are constructable from a dense triplet set in polynomial time, 2009. http://arxiv.org/abs/0901.1657.
  • [27] G. Valiente. Extended newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics, 9:532+, December 2008.
  • [28] L. J. J. van Iersel, J. C. M. Keijsper, S. M. Kelk, L. Stougie, F. Hagen, and T. Boekhout. Constructing level-2 phylogenetic networks from triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009. To appear.
  • [29] L. J. J. van Iersel and S. M. Kelk. SIMPLISTIC: Simple network heuristic, 2008. http://homepages.cwi.nl/~kelk/simplistic.html.
  • [30] L. J. J. van Iersel and S. M. Kelk. Constructing the simplest possible phylogenetic network from triplets. Algorithmica, 2009. To appear.
  • [31] L. J. J. van Iersel, S. M. Kelk, and M. Mnich. Uniqueness, intractability and exact algorithms: Reflections on level-k phylogenetic networks. Journal of Bioinformatics and Computational Biology, 7(2):597–623, 2009.
  • [32] B. Y. Wu. Constructing the maximum consensus tree from rooted triples. Journal of Combinatorial Optimization, 8(1):29–39, 2004.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description