A Practical Algorithm for Reconstructing Level1 Phylogenetic Networks
Abstract.
Recently much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here we present an efficient, practical algorithm for reconstructing phylogenetic networks  a type of network slightly more general than a phylogenetic tree  from triplets. Our algorithm has been made publicly available as the program Lev1athan. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponentialtime exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, Lev1athan runs in polynomial time and always constructs a level1 network. If the data is consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some network, it will find such a network. The potential of Lev1athan is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that Lev1athan is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise.
1. Introduction
Phylogenetic networks such as the one depicted in Figure 1 provide a natural and powerful extension of the concept of a phylogenetic tree (see Section 2 for precise definitions of these two concepts as well as the other concepts used in this paper) to accommodate complex evolutionary processes such as hybridization, recombination or horizontal gene transfer. Consequently, their attractiveness to evolutionary biology as a model for representing the evolutionary past of a set of taxa (e.g. species represented by gene or genetic marker sequences) whose evolution might have been driven by such processes has grown over the years. This in turn has generated much interest in these structures from mathematicians and computer scientists working in phylogenetics [9, 13, 15, 23].
The desire of biologists to use everlonger sequences, combined with
the computational complexities involved in dealing with such sequences,
has meant that much research in mathematical methodology
and algorithm development has focused on developing indirect
methods for phylogenetic reconstruction. This has spawned the thriving area of supertree construction
[2] which aims to reconstruct a phylogenetic tree by puzzling it together
from smaller trees. Inspired by this  and guided by
the fundamental observations
that (i) a phylogenetic tree is uniquely described by the
set of triplets (i.e. rooted binary phylogenetic trees on three leaves) it induces (see e.g. [24]) and (ii) that
a phylogenetic network is a generalization of a phylogenetic tree,
two main tripletbased approaches for phylogenetic network reconstruction
immediately suggest themselves. One is to essentially first employ a
method such as TreePuzzle (see e.g. [22]) to generate a set of
quartets
For any set of triplets, a phylogenetic network that, in a welldefined sense, is consistent with all triplets in can easily be constructed if the complexity of the network is essentially unbounded [17, 31]. Such networks are however only of marginal biological relevance. Researchers have therefore turned their attention to studying restricted classes of phylogenetic networks. One such class is that of a network or a galled tree i.e. a phylogenetic network in which essentially any two cycles are vertex disjoint (see Figure 1 for an example). Such networks are of practical interest because (i) they are relatively treelike, and (ii) their simple structure suggests the possible existence of fast algorithms to construct them. Indeed, Jansson, Sung and Nguyen showed that it can be decided in polynomial time whether there exists a network with leaf set consistent with all input triplets [16, 17], if the input triplet set is dense, i.e. if a triplet is given for each combination of three taxa in . Their algorithm will also construct such a network, if it exists. However, a network consistent with all input triplets might not exist for several reasons. Firstly, the real evolutionary history might be too complex to be described by a network. Secondly, some of the input triplets might be incorrect (which is likely to be the case in practice).
One response to this problem has been to increase the complexity of the networks that can be modelled. For example, it has been shown that, for each fixed nonnegative integer , the problem of constructing a level network consistent with a dense set of input triplets is polynomialtime solvable [26, 28]. The higher the level, the higher the complexity of evolutionary scenarios that can be represented. However, the running time grows exponentially with and initial experiments with the related heuristic Simplistic [30, 29] show that, since these algorithms insist on full consistency with the input triplets, only a small amount of noise is required in the input data to artificially inflate the level of the produced network. This causes an undesirable increase in both running time and network complexity.
A second strategy is to place a ceiling on the complexity of the networks that can be constructed and to no longer demand full triplet consistency. Implemented in the program Lev1athan [12], this paper adopts this second strategy and presents the first heuristic algorithm for constructing networks from triplets. Given any set of triplets, our heuristic always constructs a network in polynomial time, which is in practice a great advantage in light of the algorithmic results mentioned above. Moreover, it attempts to construct such that it is consistent with as many of the input triplets as possible. If a weight is given for each triplet reflecting for example some kind of confidence level one might have in that triplet, then our heuristic aims to maximize the total weight of the input triplets consistent with . Optimizing such functions is NPhard [31, 32], even for the case of determining a tree consistent with the maximum number of triplets from an unweighted, dense triplet set. Having said that, an exponentialtime exact algorithm was proposed in [31] that always finds an optimal solution, but this algorithm is only practical for small numbers of taxa. In addition, polynomialtime approximation algorithms have been formulated for level1 network reconstruction. For example it was first shown in [16] how to construct in polynomial time such a network consistent with 5/12 of the input triplets. This was subsequently improved to in [4], which is worstcase optimal. Both algorithms are mathematically interesting, but have the drawback that they produce networks with a highly rigid topology which are biologically unrealistic.
Lev1athan, which we outline in Section 3, combines elements from the above mentioned approaches into an algorithm with a strong recursive element. In addition to its polynomial running time, Lev1athan enjoys the following desirable properties. If a set of input triplets is consistent with a tree, or if at any stage in a recursion such a triplet set occurs, a phylogenetic tree will be generated from . Similarly, whenever the triplet set is dense and fully consistent with some network, such a network will be produced. Both outcomes are a direct consequence of a partitioning strategy that we describe in Section 4. If a network is produced that (when ignoring directions on the arcs) contains cycles of moderate size, then, due to a novel exponentialtime exact algorithm which we describe in Section 5, the topology of each of these cycles is locally optimal. This algorithm is complemented by a novel greedy algorithm to construct larger cycles which we also describe in that section. In addition, the output network is guaranteed to be consistent with at least 5/12 of the input triplets, if one uses the Lev1athan option “blocks 3” which has its origin in the above mentioned partitioning strategy. In Section 6 we test Lev1athan on synthetic and real biological data; to facilitate this we also develop a novel network generation method and discuss several measures for network comparison. Taken as a whole the results of the experiments are promising, suggesting that Lev1athan will be genuinely useful in a realworld context, but do also highlight some limitations of tripletbased approaches and level1 networks as a model of evolution.
We conclude this section with remarking that although Lev1athan can be applied to both weighted and unweighted triplet sets for clarity we will restrict our exposition to unweighted triplet sets.
2. Basic Concepts
We start with some concepts from graph theory concerning directed acyclic graphs. Suppose is a directed acyclic graph, or DAG for short. Then is called connected (also called “weakly connected”) if, when ignoring the directions on the arcs, there is a path between any two vertices of . Moreover, is called biconnected if contains no vertex whose removal disconnects . A biconnected subgraph of a graph is said to be a biconnected component if there is no biconnected subgraph of that contains . A biconnected component is said to be nontrivial if it contains at least three vertices. A vertex of is called an ancestor of a vertex if there is a directed path from to . In this case, is also called a descendant of . Now suppose that and are two distinct vertices of . Then a vertex is a lowest common ancestor of and if is an ancestor of both and and no descendant of is also an ancestor of and .
A (phylogenetic) network on some finite set of taxa is a DAG with a single root (a vertex with indegree 0), whose leaves (vertices with outdegree 0) are bijectively labelled by the elements of . Following common practice, we identify each leaf with its label. Thus, the leaf set of , denoted , is . Vertices with indegree 1 are called split vertices and vertices with indegree at least two are called reticulations. A network is called semibinary if each reticulation has indegree 2 and binary if in addition each reticulation has outdegree 1 and each split vertex outdegree 2. We will restrict our exposition to binary networks, but remark that Lev1athan uses postprocessing to simplify a constructed network by collapsing arcs while making sure that triplet consistency is preserved. This can create vertices with outdegree greater than two.
A semibinary network is said to be a level network, if each biconnected component contains at most reticulations. Clearly, if then this implies that is a phylogenetic tree in the usual sense (see e.g. [24]). It is not difficult to verify that for , a nontrivial biconnected component of a level network always has the form of two directed paths that both start at some vertex and both end at some vertex but which are otherwise vertex disjoint. Such a biconnected component is often called a reticulation cycle or a gall. We define the size of a gall to be the number of vertices in it i.e. the number of outgoing arcs plus one. The network in Figure 2(a) contains a single gall of size 7, for example. Let be a phylogenetic network. An arc of is said to be a cutarc if removing disconnects . A cutarc is called trivial if its head is a leaf. A level network is said to be a simple level network, if contains no nontrivial cutarcs and is not a level network. Thus, a simple network can be thought of as a single gall with some leaves attached to it, as in Figure 2(a). Simple networks are the basic, recursive building blocks of networks, in the sense that each gall of a network essentially corresponds to a simple network.
A (rooted) triplet is a phylogenetic tree on such that the lowest common ancestor of and is a proper descendant of the lowest common ancestor of and , see Figure 2(b). For convenience, we call a set of triplets a triplet set. A triplet is said to be consistent with a network (interchangeably: is consistent with ) if contains a subdivision of , i.e. if contains distinct vertices and and pairwise internally vertexdisjoint paths , , and . For example, the network in Figure 2(a) is consistent with (among others) the triplets , and but is not consistent with (among others) the triplets and . We say that a triplet set is consistent with if all triplets in are consistent with and denote the set of all triplets on consistent with by .
We note that, whenever a network contains a gall of size 3, can be replaced by a single split node to obtain the network where . In such cases we prefer the latter construction to the gall because it is a more parsimonious response to the input data. We henceforth sharpen the above definition of a level1 network to exclude galls of size 3. (Galls of size 2 are already excluded because we do not allow multiple arcs).
3. A brief outline of Lev1athan
In this section with present a rough outline of our 4 phase program Lev1athan which is implemented in Java and freely available for download [12]. It takes as input a set of triplets or, more generally, a set of phylogenetic trees (e.g. gene trees) given in Newick format [8]. In the latter case the (weighted) union of the triplet sets induced by the trees in is taken by Lev1athan as triplet set . It outputs a (possibly postprocessed) level1 network in DOT format [11] and/or eNewick format [27]. The goal of Lev1athan is to construct a level1 network that maximizes the number of triplets in that are consistent with it. In an optional postprocessing phase the generated network can then be simplified by contracting all arcs of where neither nor is a reticulation, is not a leaf, such that the contraction would not cause a triplet in that was consistent with to become not consistent with it. (Leaving such arcs uncontracted is tantamount to an unfounded strengthening of our hypothesis about what the “true” evolutionary scenario looked like). To explain the program’s four phases, we require some more definitions and notation. To this end, suppose is a set of triplets and define the leaf set of to be the set . For any subset , we denote by the set of triplets in such that .
Note that the algorithm does not backtrack in the sense that it never revises earlier made decisions. Also note that while the recursion and merging phases are relatively simple, the partition and gall construction phases are not. For these phases we designed new algorithms, which will be described in the following two sections.
4. Partitioning the Leaf Set
Based on their order of priority, we next describe the three steps that make up the partitioning strategy employed by Lev1athan to find a suitable partitioning of the leaf set of a triplet set. To explain this strategy in detail, assume for the rest of this section that is a set of triplets and put .
The first step of Lev1athan’s partitioning strategy is to determine
whether an Aho move is possible for . This move is based on an
algorithm originally introduced by Aho et. al. in [1].
Following [24] where this algorithm is referred to as
Build algorithm, this algorithm relies on the clustering graph
induced by on any subset . For the
convenience of the reader, we remark that for any subset
the vertex set of
is and any subset forms an edge in
if there exists some such that .
Aho move: Construct the clustering graph
. If
is not connected then use the vertex sets of the
connected components of as the
blocks of the partition . Otherwise, say that the Aho move
is not successful.
Note that the Aho move is essentially the embedding of the Build algorithm inside Lev1athan. Always starting with such a move means that, if a set of triplets is consistent with a phylogenetic tree, then a phylogenetic tree will be output. More generally it means that Lev1athan will potentially produce level1 networks with treelike regions. Also note that if the Aho move is successful then the Gall construction phase of Lev1athan is not necessary: the components of the clustering graph will simply correspond to subnetworks that the algorithms “hangs” from a single split vertex, just like the Build algorithm. (This can lead to split vertices with more than two children, which as already mentioned Lev1athan supports via postprocessing).
Our motivation for prioritising the Aho move is twofold. Firstly, it adheres to the parsimony principle: if the data can be explained equally well by both a phylogenetic tree and a network, choose the tree. As a nontrivial example, note that the triplet set induced by the phylogenetic tree on depicted in Figure 3(a) is also consistent with the level1 network on depicted in Figure 3(b) and that is in some sense minimal with this property, that is, no arc of can be deleted such that the triplets in remain consistent with the resulting phylogenetic tree. Secondly it exploits the not entirely trivial observation that, when attempting to construct a level1 network consistent with the maximum number of triplets in a given set, it can never be suboptimal to make an Aho move, when this is possible [25].
If the Aho move is not successful, the next step is to try a JNS move.
To explain this move we require a further concept which was originally
introduced by Janson, Nguyen and Sung in [16].
Suppose . Then is called an SNset (of )
if there exists no triplet in such
that and .
A SNset of is called maximal if and there does not exist
an SNset such that .
Also, for some partition of
into blocks , , we define the
induced triplet set of as the set
of all triplets on for which there
exists a triplet where and and
, and are all distinct.
Note that is dense if is dense.
JNS move: If (a) the set
of maximal SNsets of is a partition of , and (b)
is a dense set of triplets that is consistent with
some simple level1 network ,
then define to be the sought after partition
(and use as in the Gall construction phase).
Otherwise and as in the case of an Aho move, say that the move is not
successful.
The JNS move is essentially the level1 analogue of the above described Aho move. The main difference is that although, using for example the Build algorithm, it is always possible to decide in polynomial time if a triplet set is consistent with a phylogenetic tree it is in general NPhard to determine whether there is a level1 network that is consistent with the set (even if we restrict to simple level1 networks). However, the situation changes and becomes decidable in polynomial time if the triplet set in question is dense [16]. Density of the induced triplet set is therefore a necessary requirement for a successful JNS move. Hence, requirement (b). To motivate requirement (a), note that for dense triplet sets the set of maximal SNsets of always forms a partition of . For general (i.e. nondense) triplet sets this is not always, but sometimes, true. Requirement (a) is thus included to extend JNS moves to nondense triplet sets.
We conclude the discussion of the JNS move with remarking that this move enjoys the same optimality properties as an Aho move. More precisely, prioritising the JNS move guarantees that a level1 network consistent with all the original triplets will be produced if such a network exists and the original triplet set was dense. (And, less obviously, that making a JNS move can never lead to a suboptimal network, assuming subsequent recursive steps give optimal networks). Again analogous to a successful Aho move, a successful JNS move allows the full generality of the Gall construction phase (see below) to be bypassed by utilizing the already computed simple level1 network .
Due to noise in real biological data (or the inherent complexity of the underlying network) however, it is generally too much to hope for that one of the two moves described so far will always be successful. We therefore have adapted a strategy from [16] to obtain a third move which we call the Heuristic move. The heart of this move is formed by a score function , a partition of the leaf set of , from [16] plus two operations (P1) and (P2) that will allow us to manipulate a partition with the aim of ensuring that the score of the new partition is at least as good as the score of the given partition, i.e. holds. To describe both the score function and these two operations, we require some more concepts. Suppose is a partition of . Then to we associate the following four subsets
Note that clearly forms a partition of . The aforementioned score function is then defined as . Denoting as above a given partition by and the generated partition by then the two aforementioned operations (P1) and (P2) are defined as follows.

If and then .

If with and then .
Armed with these definitions, we are now ready to state the Heuristic
move.
Heuristic move: Starting with
we exhaustively search for an operation
(P1) or (P2) which, when applied to
, will create a partition
with .
If no such exists then
the sought after partition is and
we are done. Otherwise put and repeat.
Note that the Heuristic move is guaranteed to terminate in
polynomial time and that it will never return as the final partition
[16]
5. Constructing Simple Level1 Networks
In this section, we turn our attention to the second phase of Lev1athan which is concerned with constructing a simple level1 network from a triplet set such that the number of triplets in consistent with that network is maximized. Since this optimization problem is NPhard [16] in general, we propose to do this heuristically. (We remark that for small instances Lev1athan will also compare the heuristically computed simple level1 network with an optimal tree computed using Wu’s exponentialtime algorithm [32]. If the tree is consistent with at least as many triplets as the simple level1 network then Lev1athan will return the tree, parsimony again being the motivation for this. We will not discuss this step further). Once again we emphasise that if Lev1athan chooses for an Aho or JNS move when partitioning the leaf set of a triplet set, then the algorithms described in this section will not be used by Lev1athan.
To be able to describe our heuristic we need to generalize our notion of consistency which we will do next. Suppose that is a triplet set and that is a partition of the leaf set of . Suppose also that are distinct elements in and that is a network with leaf set . Then we say that is consistent with if there exist distinct sets with such that is consistent with . Furthermore, we say that is inconsistent with if is not consistent with but there do exist distinct sets with , and . Note that, with this slightly generalized definition, it is possible that a triplet in is neither consistent nor inconsistent with i.e. when contains two or more leaves that lie in the same block of .
We are now in the position to briefly outline the second phase of our heuristic. Suppose is a triplet set and is a partition of . Then if the cardinality of is moderate (by default: at most 12) we compute an exact optimal solution in exponential time. This exact algorithm is described in Section 5.1. If the cardinality of is too big to compute an exact optimal solution, we use the greedy algorithm described in Section 5.2. Note that, although stated in terms of a triplet set and a partition of (i.e. the partition chosen by the previous step of Lev1athan), both algorithms can also be used for general leaf and triplet sets: for an arbitrary set of triplets the partition of consisting of only singletons can be taken as .
5.1. Exact Algorithm
Van Iersel et al. [31] proposed an exponentialtime exact algorithm to find a level1 network consistent with a maximum number of triplets of a given set in time where and . This section describes how this running time can be improved to if an algorithm searches only for simple level1 networks. To describe such an algorithm we require two more concepts that concern phylogenetic trees. A phylogenetic tree is said to be a caterpillar if the parents of the leaves form a directed path. We say that a caterpillar ends in leaf if the sibling of is also a leaf of (each caterpillar thus ends in exactly two leaves). For example the phylogenetic tree depicted in Figure 3(a) is a caterpillar that ends in leaves and .
Our algorithm, presented in the form of pseudocode in Algorithm 2, consists of 2 steps and takes as input a triplet set and a partition of the leaf set of . It returns an optimal (in a well defined sense) simple level1 network on . It works by essentially trying each set as the leaf below the reticulation of the simple level1 network to be constructed. More precisely and with and as above, the first of its 2 steps is as follows. For each set we do the following. For each subset with , we first compute an optimal caterpillar ending in in the sense that the number of triplets in consistent with the caterpillar is maximal. To achieve this, note that the caterpillar with consists of a root and 2 distinct leaves each of which is the head of an arc starting at the root (lines 24). To find the other caterpillars, we loop through all subsets with from small to large, starting at the subsets of size three. For each such subset of , we loop through all elements and create a candidate caterpillar by creating a new root and two arcs leaving the root: one to and one to the root of (lines 59). Among all candidate caterpillars, we then select a caterpillar that is consistent with a maximum number of triplets in .
Once all the caterpillars have been constructed, an optimal simple level1 network is found in the second step as follows. We loop through all bipartitions of . First suppose that and are both nonempty. Then we combine the caterpillars and into a candidate simple level1 network as follows. Let be the arc entering in , let be the arc entering in and let denote the roots of and respectively. We add a new root and a new vertex and replace arcs and by new arcs and (lines 1014). See Figure 4(a) for an example of this construction.
Now suppose that . Let again be the arc entering in and let be the root of . We add a new root and a new reticulation and replace arc by arcs and . This leads to the candidate network . See Figure 4(b) for an example of the construction in this case. Finally, we select a network consistent with a maximum number of input triplets, over all constructed candidates and return that network.
A straightforward analysis of the above algorithm implies the following result.
Theorem 1.
Given a triplet set with and , a simple level1 network consistent with a maximum number of triplets from can be constructed in time.
We now turn our attention to presenting our greedy algorithm which follows the same philosophy as the previous algorithm i.e. try each set in as the leaf below the reticulation of a simple level1 network to be constructed.
5.2. Greedy Algorithm
With and as above, we next give the details of our greedy algorithm, which we present in the form of pseudocode in Algorithm 3. In the context of this, it should be noted that a similar strategy was shown to perform particularly well in an algorithm by Snir and Rao for the construction of phylogenetic trees from triplets [25].
For each set , we first create an initial network with three vertices and and three arcs and (lines 24), where the arc occurs twice. To this network we then add the other elements from in a greedy fashion and each time renaming the resulting network . To decide which element to insert first, and where to insert it i.e. into which nontrivial arc of , we use a score function . To present this score function suppose that and are vertices of such that is a nontrivial arc of and that . Let denote the network obtained from by removing arc and adding two vertices and and three arcs and to . Then the score is equal to the number of triplets with that are consistent with minus the number of triplets with that are inconsistent with . In other words,
Note that the definition of does not consider triplets for which contains at least one element in that has not yet been added to the network. Also, the definition does not consider triplets that are neither consistent nor inconsistent with . (This is because the role of such triplets in the final network is entirely determined by the choice of and choices made in later recursive steps).
Suppose and are such that is maximized. Then we construct a simple level1 network and insert the remaining leaves into this network by the same method (lines 58). Finally and by searching through all constructed simple level1 networks we return that the network that maximizes the number of triplets in it is consistent with (lines 910).
The algorithm can thus be summarised as follows.
6. Experiments
To better understand the behavior of Lev1athan, we
performed a biologicallymotivated simulation study using triplet sets
consistent with level1 networks of different size and
complexity and various amounts of missing data (experiment 1) and of
noise (experiment 2). To ensure not only variability
of the input triplet sets for Lev1athan but also consistency
with a level1 network, we used a novel algorithm for random
level1 network generation. After giving a general
outline of our simulation study in the next section,
we describe this algorithm in Section 6.2.
To model missing data and noise, rather then using
the triplet set induced by such a network
we used a triplet set as input for
Lev1athan where is a parameter that governs the amount of
missing data/noise in .
Details on the precise definition of
will be given in the next section.
Using a range of measures which we describe in
Section 6.3 we present the
outcomes of our study in Section 6.3.1 in case of
missing data and in Section 6.3.2 in case of noise.
6.1. General outline of our simulation study
Our simulation study consists of two experiments each of which is motivated by a situation one might encounter in a triplet based phylogenetic network analysis. The full results of (and inputs to) both experiments are available in the supplementary data section of [12]. The first experiment (Section 6.3.1  missing data) deals with the situation that only some of the triplets induced by some unknown level1 network are given. This phenomena is modeled by setting to be a randomly selected subset of . The second experiment (Section 6.3.2  noise) addresses the situation that the confidence level one might have in the triplets generated in a phylogenetic analysis might vary. In our experiment this is modeled by adding noise to . Put differently, we essentially construct by randomly selecting triplets from and for each such selected triplet randomly “flipping” it to one of the two other possible triplet topologies on .
For both simulation experiments, the general outline is as follows. We first choose some level1 network as the “seed” for our random level1 generator algorithm. For the generated level1 network , we then constructed the triplet set from and use as input for Lev1athan. The level1 network generated by Lev1athan we then compared against with regards to topology and also the four measures described in Section 6.3. In each experiment we used a total of 110 randomly generated level1 networks with leaf set size ranging between and and number of reticulations ranging between and . The running time of Lev1athan on a standard desktop computer ranged from 23 seconds for the simplest networks to 30 seconds for the most complex ones.
6.2. Generating random level1 networks
A survey of the literature suggested the NetGen [19] program to be the only available approach for systematically generating networks. Whilst NetGen addresses the issues of size (i.e. number of vertices) and network complexity one would encounter when manually constructing networks, one of its main drawbacks is the lack of guarantee that the generated network is level1 (although some happen to be level1 networks). One way to overcome this problem is to hand pick suitable networks from a (relatively small) subset of generated networks. Alternatively the list of networks generated by NetGen could be postprocessed by manually removing a sufficient number of reticulations from each network in to obtain a level1 network. A further and maybe more important drawback of NetGen is that the structure of a gall in a generated network is rather simple in the sense that its size is 4. We therefore developed our own algorithm for generating level1 networks. This algorithm is implemented in Java and freely available for download from [12]. We next describe this algorithm.
Our algorithm for generating level1 networks takes as input a level1 network on a fixed number of leaves plus a positive integer and outputs a level1 network on a larger leaf set. A pseudocode form of the algorithm is presented in Algorithm 4.
We remark that the size of the gall in clearly influences the variability of networks generated by this algorithm. Also, we remark that for the purpose of the discussed simulation study was the level1 network depicted in Figure 5(a).
Algorithm 4 starts by generating instances of and storing them in set (lines 3 – 5). Next (lines 69), for each we first randomly chose an integer with and then randomly delete vertices , from making sure that no deleted vertex is the root or a leaf of . If is a reticulation then we randomly choose one of the parents of , say , and add a new arc from to the unique child of . For any other deleted vertex we reconnect its unique parent with one of its children. The child to be reconnected is selected randomly, but a child that is not a leaf is always preferred over a child that is a leaf. We initialize our output level1 network with the network (line 10) and then iterate over , where . At each iteration we randomly select a leaf from network and replace it by (line 12) yielding a new level1 network . Finally, we randomly remove a random number of leaves and cherries from (lines 1315) and then return the resulting network which we also call . We conclude the description of Algorithm 4 by making the trivial observation that the size of depends on the number of .
To illustrate the inner workings of the level1 network generator algorithm suppose is the level1 network with leaf set depicted in Figure 5(a) with and suppose that . Then we first generate instances of that network. The deletion of vertices from each of that networks as described in line 9 of Algorithm 4 results in the networks depicted in Figures 5(b) and 5(c). Next, leaf in the network depicted in Figure 5(b) is replaced with the network pictured in Figure 5(c). The resulting network is depicted in Figure 5(d). Note that due to the small number of leaves of subNet, the operation of removing random leaves and cherries from that network (lines 1315) is not executed.
6.3. Measures
Reflecting the fact that our simulation study is aimed at assessing the reconstructive power of Lev1athan by measuring the similarity between a level1 network and the level1 network generated by Lev1athan when given , we considered 4 different measures. To be able to describe these measures, we need to fix some notation first. For the remainder of this section and unless stated otherwise, let be a level1 network and let denote the network that Lev1athan generates when given as input.
The first measure is the networknetwork triplet consistency measure. For and we define the networknetwork triplet consistency measure as the quantity
Loosely speaking, is the proportion of “definitely correct” triplets (i.e. ) that are consistent with . Thus is always a real number between and . A variant of this, the tripletnetwork triplet consistency measure , pays less heed to the origin or accuracy of the input triplets and is defined for an arbitrary triplet set and a phylogenetic network by putting
In other words, is the fraction of triplets in that is consistent with the network . Note that this measure is different from the triplets distance introduced in [6].
The third of our triplet based measures is inspired by the quartet distance for unrooted phylogenetic trees [3] and is called the networknetwork symmetric difference. For and the network network symmetric difference between and is defined as the quantity
is thus the number of triplets that appear in but not in , or viceversa. Note that in this measure and are compared with regards to their induced triplet sets, while was generated in response to the set derived from . As already noted above for the networknetwork triplet consistency measure, the networknetwork symmetric difference measure also suggests a natural variant of it, , defined for an arbitrary triplet set and a network by putting .
Regarding the and measures it should be noted that the former might be more powerful than the latter. To see why consider again the example of the triplet set induced by the caterpillar on depicted in Figure 3(a). As already pointed out earlier, the level1 network on depicted in Figure 3(b) is also consistent with and no arc of can be deleted from so that consistency with is preserved. If we take , then (in the context of networknetwork triplet consistency) and are equally good level1 networks for representing since . However, under the networknetwork symmetric difference measure would be preferable over , because of precisely that proper subset relationship.
The final measure we considered is the distance which was originally introduced in [7]. To define this measure which was shown in [5, 7] to be a metric for a certain class of networks (i.e. treechild networks), which includes the class of level1 networks as a subclass, we require some more notation. Suppose is a phylogenetic network on some set , , and is a vertex of . Then the vector can be associated to where for all the quantity represents the number of different paths in from to leaf . With denoting by the multiset of all vectors , a vertex of , and calling the representation of , the distance between any two phylogenetic networks and is defined as
where the symmetric difference operator is defined here over multisets.
Armed with these measures and our algorithm for randomly generating level1 networks, we are now ready to describe the results of our simulation study. (We note that, because the source networks used in the simulation had no vertices of outdegree 3 or higher, and for technical reasons, the optional postprocessing phase used by Lev1athan was switched off during these simulations). Assume then from now on that is a level1 network generated by our random level1 network generator described in Algorithm 4 and that the definition of the network is as before. We start with presenting our results for the missing data experiment.
Simulation results  missing data experiment
Central to this experiment is the parameter which indicates the probability that a triplet in will be included in . The values of that we investigated were (i.e. all triplets in ), .
Modulo a wellunderstood exception (described below) all networks generated via Algorithm 4 were recovered correctly by Lev1athan when . This is a consequence of the fact that Lev1athan prioritises JNS moves and that a level1 network is completely defined by up to, but not including, galls of size 4 [10]. This is a drawback of any triplet based phylogenetic network approach since such approaches have to make a choice between the three galls on a set {} that are all consistent with . This inability to distinguish between the topologies of size4 galls does not, of course, affect the tripletbased measures, and for we had indeed in all cases that and . When Lev1athan correctly guessed the topologies of all size4 galls in a network we additionally had , but this value became nonzero when the guess was incorrect.
For all other values of and all networks , we frequently observed that  although similar when inspected visually  some of the galls from were not reconstructed correctly in the generated level1 network , in the sense that the size of a gall in was smaller than in the corresponding gall in (e.g. a size 5 gall in became size 4 gall in , see Figure 6). Although this observation is clearly dependent on the specific value used for , it generally meant that one or more vertices had been pushed out of a gall in to its sides, causing what we will call typical gall damage for . In turn this means that subnetworks hanging from galls are sometimes merged by Lev1athan into a single subnetwork. Even in the presence of typical gall damage, however, Lev1athan sometimes (but not always) correctly inferred which leaf of needed to be placed below the reticulation of the damaged gall. Interestingly, we also observed that typical gall damage was rare for galls  even for low vales of  if the subnetworks hanging from contain many leaves. Expressed differently, the likelihood that a gall in suffers typical gall damage for is higher if the gall is closer to the leaves. The reason for this might be that is supported by far fewer triplets then a gall closer to the root.
It should be noted that the exception to the former of the last two observations is the galled caterpillar network [4, 16] which can be thought of as a natural level1 generalization of a caterpillar. Such networks were correctly reconstructed by Lev1athan for .
In addition, we observed for all networks that when dropped from 1.0 to 0.9, the networknetwork triplet consistency between and also drops slightly (often from 1 to a value in the range 0.95  0.99) but immediately stabilises around that value, even for very low values of . This phenomenon even occurs when there are very few or even no Aho or JNS moves occurring, suggesting that Heuristic moves are (in this context) good for sustaining a very high value of triplet consistency. However, and for all values of other than and all networks , we also observed that even when is very close to 1, is often not recovered correctly by Lev1athan from . An example that illustrates this point is the triplet set of the level1 network depicted in Figure 6(b). This triplet set together with the triplet is the triplet set induced by the level1 network depicted in Figure 6(a). The network generated by Lev1athan from is the level1 network and . Thus, rather than being a suitable tool for assessing Lev1athan’s reconstructive power, the measure might be of limited use for capturing differences between networks.
For all values of and averaged over all networks for , we observed an initial sharp rise for both the networknetwork symmetric difference and the distance when drops from to . Again in both cases this initial rise is followed by a a much slower growth rate although this rate seems to be higher for the distance. Encouragingly, we found instances of networks where for Lev1athan correctly inferred the missing triplet information, that is, correctly reconstructed from .
Simulation results  noise experiment
Central to this experiment is again the parameter which in this case is an error probability and which we state in terms of values between and . Its purpose is to introduce noise into and the range we considered was and . More precisely, we generated from by taking each triplet and with probability decide to corrupt it, that is, replace in with one of the 2 other phylogenetic trees on , chosen uniformly at random.
We observed that even for very low error probability values, the triplet set is unlikely to be consistent with a level1 network. It is thus not surprising that we often observed additional (erroneous) reticulations in networks reconstructed by Lev1athan from . Nevertheless and based on visual inspection, our experiment seems to indicate that even for slightly higher values of , i. e , and all networks , much of the structure of is recovered correctly by Lev1athan from . Having said that, and in addition to the above mentioned erroneous reticulations, typical gall damage is common in the generated networks.
For all other than 0.00 and averaged over all networks for , we observed that, as expected, decreased linearly with increasing error probability in the sense that holds. Reassuringly (and by no means obviously) this almost linear relationship is not obeyed by the networknetwork triplet consistency measure . By averaging for each error probability value over all networks , we summarize our findings for that measure in Figure 7 in terms of plotting the error probability values for against the corresponding networknetwork triplet consistency values. Intriguingly, there is an initial 2% drop after which networknetwork triplet consistency remains high until error rates reach values of . It should be noted that this in fact corresponds to a high level of noise in given that if is a phylogenetic tree and then is a completely randomized triplet set and all structural information contained in concerning has been lost. The most likely explanation for the initial 2% drop is probably (again) typical gall damage since, as alluded to above, a given gall represents more triplets then its a galldamaged counterpart. The fact that after the initial drop networknetwork triplet consistency remains high, is encouraging, because it shows that Lev1athan holds promise for reconstructing triplets that have not been corrupted by noise, up to quite a high level of noise.
For all values of and averaged over all networks for , we also observed an initial sharp increase when increases form to for both the networknetwork symmetric difference and the difference. After this, the values for both measure grow very slowly with the growth rate for the distance being higher until reaches when the growth rate of the networknetwork symmetric difference takes over. Intriguingly the networknetwork symmetric difference seems to reach a peak at and then drops of again for . However and in the light of the fact that, as pointed out above, is already a high error probability the later 2 observations should be treated with caution. A possible reason for the above mentioned initial sharp rise might be that even one corrupted triplet can potentially introduce erroneous (additional) galls in . Combined with the problem of typical gall damage, this can greatly affect the triplet sets induced by and as well as their representations and thus the networknetwork symmetric difference and the distance which rely on these concepts respectively.
Interestingly we identified some networks such that when (and additionally ) we nevertheless had that . Visual inspection revealed that in such cases was equal to . This suggests that if the noise level in an input triplet set is small enough Lev1athan might still be able to correctly reconstruct the level1 network that induced that triplet set.
6.4. A HIV1 virus data set
To illustrate the applicability of our approach, we reanalyzed a HIV1 virus data set that originally appeared in [21, Chapter 14]. All but one of the sequences (KAL153) making up that data set are 50 percent consensus sequences for the HIV1 Mgroup subtypes , , and with the KAL153 strain being thought to be a recombinant of subtypes and . Recombinants such as KAL153 can essentially be thought of as having arisen via the transfer and integration of genetic material from one evolutionary lineage into another. The positions in an existing sequence where the foreign genetic material was integrated are generally called breakpoints and many approaches for detecting recombination aim to identify these breakpoints.
For the above data set two breakpoints were identified (positions 2700 and 8926) in [21, Chapter 14]. Furthermore, for the three induced subalignments 12699, 2700 8925, and 89269953 the Neighbor Joining method [20] (with subtype as outgroup and the K2P model [18]) was used to represent the data in terms of arcweighted phylogenetic trees (see [21, page 159] for a depiction of those trees). Since the resolution patterns for and in the first tree, and in the second, and and in the third tree was not clear, we recomputed those trees using the above settings. Reassuringly and when arc weights were ignored, this resulted in the same phylogenetic trees for the first and third subalignment as in the previous analysis except that the unresolved vertex in each tree was now resolved. For the second subalignment, the same tree was obtained. For the convenience of the reader, we depict these phylogenetic trees in Figure 8.
As expected, the position of the KAL153 strain is the same in the first and third topology but different in the second topology. However and somewhat surprisingly the resolution order of subtypes and is in the first and third tree is different as well as the placement of subtype in the tree. Having said this, the branch weights that support these different resolution orders are very small. To therefore not allow this to overly influence our analysis (after all our method as well as the other two methods that we tried out are using phylogenetic trees rather than arcweighted phylogenetic trees as input and therefore these different levels of support are not taken into account), we only used the first and the second and the third and second tree as respective inputs for our analysis.
Interestingly the second phylogenetic tree postulates the triplet on subtypes , and whereas the first tree hypotheses . Since this conflict also interferes with the conflicting information for subtypes , and KAL153, attempting to combine the triplet set generated from both phylogenetic trees into a general level1 network is problematic as such a network would postulate 2 recombination events for this data set. To avoid this and at the same time identify the stronger of the two conflicting signals it is therefore more useful to construct (using Algorithm 2) an optimal simple level1 network, which by definition has only one reticulation vertex. This type of network is depicted for the first and second tree from Figure 8 in Figure 9. As expected, the network correctly identifies the strain as a recombinant of subtypes and .
It should be noted that repeating this analysis for the second and third tree in Figure 8. resulted in the same simple level1 network as the one depicted in Figure 9 except that the order of and was reversed. The same holds when the optimal simple level1 network is computed for the respective original phylogenetic trees from [21, Chapter 14] when ignoring arcweights. Interestingly and in case of the second and the third tree, exclusion of subgroup also resulted in a simple level1 network that correctly identified the KAL153 as recombinant. However this was not the case when this analysis was repeated for the first and second tree.
We conclude this section with remarking that although using different philosophies, the other two known approaches i.e. Cluster networks and Galled networks (both of which are implemented in Dendroscope [14]) also had problems with this data set, postulating between 2 and 4 recombination events (data not shown).
7. Conclusions
In this paper, we have presented a heuristic for constructing level1 networks from triplet sets which we have also implemented in Java as the freely available program Lev1athan [12]. Guided by the principle of triplet consistency, our heuristic aims to optimize a welldefined objective function on triplet sets without generating pessimistically complex networks for them. By running in polynomial time and always returning a level1 network, it addresses several of the problems that frequently occur with existing network algorithms from such input sets. Using both a biologically motivated simulation study and a biological data set, we have also explored Lev1athan’s applicability to real biological data.
Based on the outcomes of our simulation study, it appears that our heuristic is able to recover, in terms of the triplet set induced by the generated level1 network, a high percentage of the input triplets that were also present in the original network (as opposed to triplets that were missing or that had been corrupted). This is nota bene also true when the input triplet set is not dense, which (in light of the NPhardness of the nondense case) is an encouraging observation. On the other hand (and probably not surprisingly as our heuristic tries to reconstruct a global structure from local information) it seems that it is more vulnerable to noise in an input triplet set than missing data. The most probable reason for this is that the former type of problem can sometimes be rectified via implicitly inferred triplet information, whereas the latter type of problem has the capacity to actively mislead. Having said that, Lev1athan shows encouraging potential if the amount of missing data is low or there is only very little noise in the data.
In general using more of the triplets induced by a network as input for Lev1athan allows larger regions of that network to be recovered by it. Having said that, when confronted with missing data or noise, even extremely high values of networknetwork triplet consistency (e.g. above 0.95) do not preclude nontrivial differences between the original network and the network generated by Lev1athan. Additionally, the noisier a input triplet set for our heuristic is, the greater the chance that the network found by it is distorted (e.g. through the emergence of surplus galls in the generated network). To tackle this problem Lev1athan has the option to label each arc of the generated network with its deletion support value, see Figure 9. This allows the user to explore which reticulations are weakly supported, and thus might be superfluous or even artefacts of our heuristic.
Although the four tripletbased measures used in this article appear to be very natural they only seem to be of limited use for capturing network differences in general. However some of them helped to identify cases where a network generated by Lev1athan from coincided with the original network .
Our reanalysis of a biological data set from [21] using Lev1athan indicated that this data set is more complex than it appears at first sight. The resulting conflicting triplet infomation misled Lev1athan (and also the other two network construction approaches that we tried) to postulate a too complex evolutionary scenario when using its default option of generating a level1 network. However Lev1athan’s simple level1 network option had no problem with this data set and was able to correctly reconstruct the data set’s widely accepted evolutionary scenario.
To understand better how well Lev1athan performs, it will be necessary to compare it directly with an alternative method for network construction that uses similar input and also produces a level1 network. Such methods are lacking at the moment. Similarly, it is at the moment difficult to draw conclusions regarding the biological meaning of measures such as the distance, which we used in our simulation study. Comparison should also be made between Lev1athan and other programs when the input is strictly simpler, or strictly more complex, than level1 networks. Figures 10 and 11 already provide interesting insights. Figure 10 shows a level2 network created by the Level2 algorithm of [28] which is consistent with all 1330 triplets in the yeast dataset discussed in that same article. For the same dataset Lev1athan constructs the level1 network in Figure 11; this is consistent with of the 1330 triplets. Both networks cluster the taxa together similarly; the main difference is in the Lev1athan network taxa 12 and 5 have been pushed further away from the root, whilst taxon 8 has risen closer to the root. However which one of these two networks is biologically more relevant is not immediately clear. In any case it should once again be noted that Lev1athan is in many regards much more flexible than the Level2 algorithm (and related algorithms such as [26, 29, 30]) because Lev1athan always quickly returns a network and it does not require the input triplets to be dense or fully consistent with the output network. (In particular, the authors of [28] had to use timeconsuming bruteforce techniques to first find a subset of the taxa that induced a triplet set fully consistent with a level2 network).
It is also necessary to look more deeply at the underlying mathematics of Lev1athan. For example, the partitioning strategy at its heart (i.e. Phase 1) is a modification of a strategy that was originally optimized for worstcase performance, not averagecase performance. Yet this strategy also seems to perform surprisingly well in the average case. Understanding why this is, and developing a better appreciation for the theoretical limits and strengths of triplet methods as a mechanism for reconstructing network topologies, remains a fascinating area of research.
8. Acknowledgements
We thank Vincent Moulton for many very helpful conversations during the writing of this paper. Leo van Iersel was funded by the Allan Wilson Centre for Molecular Ecology and Evolution and Steven Kelk by a Computational Life Sciences grant of The Netherlands Organisation for Scientic Research (NWO).
Footnotes
 The analogue of a triplet within the area of unrooted phylogenetic tree reconstruction.
 To avoid finishing with we allow in the first, and only the first, iteration an operation to be applied as long as this does not decrease the score. After this operations are only permitted if they strictly increase the score.
References
 A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981.
 O. BinindaEmonds, editor. Phylogenetic Supertrees: Combining information to reveal the Tree of Life. Kluwer Academic Publishers, 2004.
 D. Bryant, J. Tsang, P. E. Kearney, and M. Li. Computing the quartet distance between evolutionary trees. In SODA, pages 285–286, 2000.
 J. Byrka, P. Gawrychowski, K. T. Huber, and S. Kelk. Worstcase optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks. Journal of Discrete Algorithms, 2009. To appear.
 G. Cardona, M. Llabrés, F. Rosselló, and G. Valiente. Metrics for phylogenetic networks I: Generalizations of the robinsonfoulds metric. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(1):46–61, 2009.
 G. Cardona, M. Llabrés, F. Rosselló, and G. Valiente. Metrics for phylogenetic networks II: Nodal and triplets metrics. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(3):454–469, 2009.
 G. Cardona, F. Rosselló, and G. Valiente. Comparison of treechild phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. To appear, http://arxiv.org/abs/0708.3499.
 J. Felsenstein, J. Archie, W. H. Day, W. Maddison, C. Meacham, F. J. Rohlf, and D. Swofford. The newick tree format, 1986.
 P. Gambette. Who’s who in phylogenetic networks, 2009. http://www.lirmm.fr/~gambette/PhylogeneticNetworks/.
 P. Gambette and K. T. Huber. A note on encodings of phylogenetic networks of bounded level. Technical Report arXiv:0906.4324, June 2009.
 E. Gansner, E. Koutsofios, and S. North. Drawing graphs with dot. Technical report, AT&T Bell Laboratories, 2006.
 K. T. Huber, L. van Iersel, S. Kelk, and R. Suchecki. LEV1ATHAN: A level1 heuristic, September 2009. http://homepages.cwi.nl/~kelk/lev1athan/.
 D. Huson. Reconstructing Evolution  New Mathematical and Computational Advances, chapter Split Networks and reticulate networks. Oxford University Press, 2007.
 D. Huson, D. C. Richter, C. Rausch, M. Franz, and R. Rupp. Dendroscope: An interactive viewer for large phylogenetic trees. BMC Bioinformatics, 8(1):460, 2007.
 D. H. Huson and R. Rupp. Phylogenetic Networks. Cambridge University Press. To appear.
 J. Jansson, N. B. Nguyen, and W.K. Sung. Algorithms for combining rooted triplets into a galled hylogenetic network. SIAM Journal on Computing, 35(5):1098–1121, 2006.
 J. Jansson and W.K. Sung. Inferring a level1 phylogenetic network from a dense set of rooted triplets. Theoretical Computer Science, 363(1):60–68, 2006.
 M. Kimura. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16(2):111–120, June 1980.
 M. M. Morin and B. M. E. Moret. Netgen: generating phylogenetic networks with diploid hybrids. Bioinformatics, 22(15):1921–1923, 2006.
 M. Nei and S. Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, 2000.
 M. Salemi and V. A. M. (ed). The phylogenetic handbook. A practical approach to DNA and protein phylogeny. Cambridge University Press, UK, 2003.
 H. A. Schmidt, K. Strimmer, M. Vingron, and A. von Haeseler. Treepuzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics, 18(3):502–504, 2002.
 C. Semple. Reconstructing Evolution  New Mathematical and Computational Advances, chapter Hybridization Networks. Oxford University Press, 2007.
 C. Semple and M. Steel. Phylogenetics, volume 24 of Oxford Lecture Series in Mathematics and its Applications. Oxford University Press, 2003.
 S. Snir and S. Rao. Using max cut to enhance rooted trees consistency. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 3(4):323–333, 2006.
 T.H. To and M. Habib. Levelk phylogenetic networks are constructable from a dense triplet set in polynomial time, 2009. http://arxiv.org/abs/0901.1657.
 G. Valiente. Extended newick: it is time for a standard representation of phylogenetic networks. BMC Bioinformatics, 9:532+, December 2008.
 L. J. J. van Iersel, J. C. M. Keijsper, S. M. Kelk, L. Stougie, F. Hagen, and T. Boekhout. Constructing level2 phylogenetic networks from triplets. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2009. To appear.
 L. J. J. van Iersel and S. M. Kelk. SIMPLISTIC: Simple network heuristic, 2008. http://homepages.cwi.nl/~kelk/simplistic.html.
 L. J. J. van Iersel and S. M. Kelk. Constructing the simplest possible phylogenetic network from triplets. Algorithmica, 2009. To appear.
 L. J. J. van Iersel, S. M. Kelk, and M. Mnich. Uniqueness, intractability and exact algorithms: Reflections on levelk phylogenetic networks. Journal of Bioinformatics and Computational Biology, 7(2):597–623, 2009.
 B. Y. Wu. Constructing the maximum consensus tree from rooted triples. Journal of Combinatorial Optimization, 8(1):29–39, 2004.