Beyond representing orthology relations by trees

Beyond representing orthology relations by trees

K.T. Huber, G. E. Scholz
July 19, 2019
Abstract.

Reconstructing the evolutionary past of a family of genes is an important aspect of many genomic studies. To help with this, simple operations on a set of sequences called orthology relations may be employed. In addition to being interesting from a practical point of view they are also attractive from a theoretical perspective in that e. g. a characterization is known for when such a relation is representable by a certain type of phylogenetic tree. For an orthology relation inferred from real biological data it is however generally too much to hope for that it satisfies that characterization. Rather than trying to correct the data in some way or another which has its own drawbacks, as an alternative, we propose to represent an orthology relation in terms of a structure more general than a phylogenetic tree called a phylogenetic network. To compute such a network in the form of a level-1 representation for , we introduce the novel Network-Popping algorithm which has several attractive properties. In addition, we characterize orthology relations on some set that have a level-1 representation in terms of eight natural properties for as well as in terms for level-1 representations of orthology relations on certain subsets of .

School of Computing Sciences, University of East Anglia, UK

1. Introduction

Unraveling the evolutionary past of a family of genes is an important aspect for many genomic studies. For this, it is generally assumed that the genes in are orthologs, that is, have arisen from a common ancestor through speciation. However it is known that shared ancestry of genes can also arise via whole genome duplication (paralogs). This potentially obscures the signal used for reconstructing the evolutionary past of the genes in in the form of a gene tree (essentially a rooted tree whose leaves are labelled by the elements of – we present precise definitions of the main concepts used in the next section). To tackle this problem, tree-based approaches have been proposed. These typically work by reconciling a gene tree with an assumed further tree (species tree) in terms of a map that operates on their vertex sets. For this, certain evolutionary events are postulated such as the ones mentioned above (see e. g. [12] for a recent review as well as e. g. [10] and the references therein).

To overcome the problem that the resulting reconciliation very much depends on the quality of the employed trees and also that such approaches can be computationally demanding for larger datasets, orthology relations have been proposed as an alternative. These operate directly on the set of sequences from which a gene tree is built (see e. g. [1]). In addition to having attractive practical properties, such relations are also interesting from a theoretical point of view due to their relationship with e. g. co-trees (see e. g. [5, 6]). Furthermore, a characterization is known for when an orthology relation can be represented in terms of a certain type of phylogenetic tree [5].

Due to e. g. errors or noise in an orthology relation, it is however in general too much to hope for that an orthology relation obtained from a real biological dataset satisfies that characterization. A natural strategy therefore might be to try and correct for this in some way. As was pointed out in [11] however, even if an underlying tree-like evolutionary scenario is assumed for this many natural formalizations lead to NP-complete problems. Furthermore, true non-treelike evolutionary signal such as hybridization might be overlooked. As an alternative, we propose to represent orthology relations in terms of phylogenetic networks. These naturally generalize phylogenetic trees by permitting additional edges. To infer such a structure from an orthology relation , we introduce the novel Network-Popping algorithm

Figure 1. Three distinct level-1 representations for the symbolic 3-dissimilarity on induced by . However, only is returned by Network-Popping when given . In all three cases the underlying phylogenetic network is a level-1 network. Furthermore, , is not semi-discriminating but weakly labelled whereas is semi-discriminating but not weakly labelled – See text for details.

which returns a level-1 representation of in the form of a structurally very simple phylogenetic network called a level-1 network (see e. g. Fig. 1 for examples of such representations where the interior vertices labelled in terms of and represent two distinct evolutionary events such as speciation and whole genome duplication and the unlabelled interior vertices represent hybridization events).

Bearing in mind the point made in [3, Chapter 12], that -estimates, , are potentially more accurate than mere distances as they capture more information, we formalize an orthology relation in terms of a symbolic 3-dissimilarity rather than a symbolic 2-dissimilarity (i.e. a distance), as was the case in [5]. From a technical point of view this also allows us to overcome the problem that using a symbolic 2-dissimilarity in a network context can be problematic. An example illustrating this is furnished by the three level-1 representations depicted in Fig. 2 which all represent the same 2-dissimilarity induced by taking the lowest common ancestor between pairs of leaves.

Figure 2. Three distinct level-1 representations of the 2-dissimilarity defined by taking lowest common ancestors of pairs of leaves.

As we shall see, algorithm Network-Popping is guaranteed to find, in polynomial time, a level-1 representation of a symbolic 3-dissimilarity if such a representation exists. For this, it relies on the three further algorithms below which we also introduce. It works by first finding for a symbolic 3-dissimilarity on all pairs of subsets of that support a cycle using algorithm Find-Cycles. Subsequent to this, it employs algorithm Build-Cycles to construct from each such pair a structurally very simple level-1 representation for the symbolic 3-dissimilarity induced on . Combined with algorithm Vertex-Growing which constructs a symbolic discriminating representation for a symbolic 2-dissimilarity, Network-Popping then recursively grows the level-1 representation for by repeatedly applying algorithms Build-Cycles and Vertex-Growing in concert. For the convenience of the reader, we illustrate all four algorithms by means of the level-1 representations depicted in Fig. 1. As part of our analysis of algorithm Network-Popping, we characterize level-1 representable symbolic 3-dissimilarities on in terms of eight natural properties (P1) – (P8) enjoyed by . (Theorem 7.1). Furthermore, we characterize such dissimilarities in terms of level-1 representable symbolic 3-dissimilarities on subsets of of size (Theorem 8.3). Within a Divide-and-Conquer framework the resulting speed-up of algorithm Network-Popping might allow it to also be applicable to large datasets.

The paper is organized as follows. In the next section, we present basic definitions and results. Subsequent to this, we introduce in Section 3 the crucial concept of a -trinet associated to a symbolic 3-dissimilarity and state Property (P1). In Section 4, we present algorithm Find-Cycles as well as Properties (P2) and (P3). In Section 5, we introduce and analyse algorithm Build-Cycles. Furthermore, we state Properties (P4) – (P6). In Section 6, we present algorithms Vertex-Growing and Network-Popping. As suggested by the example in Fig. 1, algorithm Network-Popping need not return the level-1 representation of a symbolic 3-dissimilarity that induced it. Employing a further algorithm called Transform, we address in Section 7 the associated uniqueness question (Corollary 7.5). As part of this we establish Theorem 7.1 which includes stating Properties (P7) and (P8). In Section 8, we establish Theorem 8.3. We conclude with Section 9 where we present research directions that might be worth pursuing.

2. Basic definitions and results

In this section, we collect relevant basic terminology and results concerning phylogenetic networks and symbolic - and -dissimilarities. From now on and unless stated otherwise, denotes a finite set of size , denotes a finite set of symbols of size at least two and denotes a symbol not already contained in . Also, all directed/undirected graphs have no loops or multiple directed/undirected edges.

2.1. Directed acyclic graphs

Suppose is a rooted directed acyclic graph (DAG), that is, a DAG with a unique vertex with indegree zero. We call that vertex the root of , denoted by . Also, we call the graph obtained from by ignoring the directions of its edges the underlying graph of . By abuse of terminology, we call an induced subgraph of a cycle of if the induced subgraph of is a cycle of . We call a vertex of an interior vertex of if is not a leaf of where we say that a vertex is a leaf if the indegree of is one and its outdegree is zero. We denote the set of interior vertices of by and the set of leaves of by . We call a vertex of a tree vertex if the indegree of is at most one and its outdegree is at least two, and a hybrid vertex of if the indegree of is two and its outdegree is not zero. The set of interior vertices of that are not hybrid vertices of is denoted by . We say that is binary if, with the exception of , the indegree and outdegree of each of its interior vertices add up to three. Finally, we say that two DAG’s and with leaf set are isomorphic if there exists a bijection from to that extends to a (directed) graph isomorphism between and which is the identity on .

2.2. Phylogenetic networks and last common ancestors

A (rooted) phylogenetic network (on ) is a rooted DAG that does not contain a vertex that has indegree and outdegree one and . In the special case that a phylogenetic network is such that each of its interior vertices belongs to at most one cycle we call a a level-1 (phylogenetic) network (on ). Note that a phylogenetic network may contain cycles of length three and that a phylogenetic network that does not contain a cycle is called a phylogenetic tree (on ).

For the following, let denote a level-1 network on . For with , we denote by the subDAG of induced by (suppressing any resulting vertex that have indegree and outdegree one). Clearly, is a phylogenetic network on .

Suppose is a non-leaf vertex of . We say that a further vertex is below if there is a directed path from to and call the set of leaves of below the offspring set of , denoted by . Note that is closely related to the hardwired cluster of induced by (see e.g. [9]). For a leaf , we refer to as an ancestor of . In case is a phylogenetic tree, we define the lowest common ancestor of two distinct leaves to be the (necessarily unique) ancestor such that and holds for all children of . More generally, for with , we denote by the unique vertex of such that , and holds for all children of . Note that in case the tree we are referring too is clear from the context, we shall write rather than .

It is easy to see that the notion of a lowest common ancestor is not well-defined for phylogenetic networks in general. However the situation changes in case the network in question is a level-1 network, as the following central result shows. Since its proof is straight-forward, we omit it.

Lemma 2.1.

Let be a level-1 network on and assume that such that . Then there exists a unique interior vertex such that but , for all children of . Furthermore, there exists two distinct elements such that .

Continuing with the terminology of Lemma 2.1, we shall refer to as the lowest common ancestor of in , denoted by . As in the case of a phylogenetic tree, we shall write rather than if the network we are referring to is clear from the context.

2.3. Symbolic dissimilarities and labelled level-1 networks

Suppose . We denote by the set of subsets of of size , and by the set of nonempty subsets of of size at most . We call a map a symbolic -dissimilarity on with values in if, for all , we have that if and only if . To improve clarity of exposition, we shall refer to as a symbolic -dissimilarity on if the set is of no relevance to the discussion. Moreover, for , , we shall write rather than where the order of the elements , is of no relevance to the discussion.

A labelled (phylogenetic) network (on ) is a pair consisting of a phylogenetic network on and a labelling map . If is a level-1 network then is called a labelled level-1 network. To improve clarity of exposition we shall always use calligraphic font to denote a labelled phylogenetic network.

Suppose is a labelled level-1 network on such that its vertices in are labelled in terms of . Then we denote by the symbolic 3-dissimilarity on induced by given by if , and otherwise. For a further labelled level-1 network on , we say that and are isomorphic if and are isomorphic and .

Conversely, suppose is a symbolic 3-dissimilarity on . In view of Lemma 2.1, we call a labelled level-1 network on a level-1 representation of if . For ease of terminology, we shall sometimes say that is level-1 representable if the the labelled network we are referring too is of no relevance to the discussion. We call a level-1 representation of semi-discriminating if does not contain a directed edge such that except for when there exists a cycle of with . For example, all three labelled level-1 networks depicted in Fig. 1 are level-1 representations of where is the labelled level-1 network depicted in Fig. 1(ii). Furthermore, the representations of presented in Fig. 1(i) and (iii), respectively, are semi-discriminating whereas the one depicted in Fig. 1(ii) is not.

Figure 3. (i) A labelled level-1 network on . (ii) and (iv) Semi-discriminating level-1 representations of restricted to and , respectively. (iii) A level-1 representation of in the form of a labelled trinet that is is not a -trinet.

Note that in case is a phylogenetic tree on the definition of a semi-discriminating level-1 representation for reduces to that of a discriminating symbolic representation for the restriction of to (see [2] and also [5, 13] for more on such representations). Using the concept of a symbolic ultrametric, that is, a symbolic 2-dissimilarity for which, in addition, the following two properties are satisfied

  1. for all ;

  2. there exists no four elements such that

such representations were characterized by the authors of [2] as follows.

Theorem 2.2.

[2, Theorem 7.6.1] Suppose is a 2-dissimilarity on . Then there exists a discriminating symbolic representation of if and only if is a symbolic ultrametric.

Clearly, it is too much to hope for that any symbolic 3-dissimilarity has a level-1 representation. The question therefore becomes: Which symbolic 3-dissimilarities have such a representation? A first partial answer is provided by Theorem 2.2 and Lemma 2.1 for not but its restriction . More precisely, has a discriminating symbolic representation if and only if is a symbolic ultrametric and, for all distinct, is the (unique) element appearing at least twice in the multiset .

3. -triplets, -tricycles, and -forks

To make a first inroad into the aforementioned question, we next investigate structurally very simple level-1 representations of symbolic 3-dissimilarities. As we shall see, these will turn out to be of fundamental importance for our algorithm Network-Popping (see Section 6) as well as for our analysis of its properties. In the context of this, it is important to note that although triplets (i. e. binary phylogenetic trees on 3 leaves) are well-known to uniquely determine (up to isomorphism) phylogenetic trees this does not hold for level-1 networks in general [4]. To overcome this problem, trinets, that is, phylogenetic networks on three leaves were introduced in [7]. For the convenience of the reader, we depict in Fig. 4 all 12 trinets on from [7] that are also level-1 networks in our sense. In the same paper it was observed that even the slightly more general 1-nested networks are uniquely determined by their induced trinet sets (see also [8] for more on constructing level-1 networks from trinets, and [14] for an extension of this result to other classes of phylogenetic networks).

Figure 4. The twelve trinets in the from of level-1 networks. The two omitted trinets from [7] are not level-1 networks in our sense.

Perhaps not surprisingly, trinets on their own are not strong enough to uniquely determine labelled level-1 networks in the sense that any two level-1 representations of a symbolic 3-dissimilarity must be isomorphic. To see this, suppose and consider the symbolic 3-dissimilarity that maps and every 2-subset of to . Then the labelled network where maps every vertex in to is a semi-discriminating level-1 representation of and so is the labelled network , where every vertex in is mapped to by . Note that similar arguments may also be applied to the level-1 representations involving the trinet to depicted in Fig. 4. We therefore evoke parsimony and focus for the remainder of this paper on the trinets , and . We shall refer to them as fork on , triplet , and tricycle , respectively.

The next result (Lemma 3.1) relates forks, triplets and tricycles with symbolic 3-dissimilarities. To state it, we say that a symbolic 3-dissimilarity satisfies the Helly-type Property if, for any three elements , we have . Note that we will sometimes also refer to the Helly-type property as Property (P1).

Lemma 3.1.

Suppose is a symbolic 3-dissimilarity on a set taking values in . Then there exists a level-1 representation of if and only if satisfies the Helly-type Property. In that case can be (uniquely) chosen to be semi-discriminating and, (up to permutation of the leaves of the underlying level-1 network ) is isomorphic to one of the trinets , and depicted in Fig. 4.

Proof.

Suppose first that is a level-1 representation of . Then, in view of Lemma 2.1, must hold.

Conversely, suppose that holds for all elements distinct. By analyzing the size of it is straight-forward to show that one of the situations indicated in the rightmost column of Table 1

1 fork
3
2
2
Table 1. For a symbolic 3-dissimilarity we list all labelled trinets on in terms of the size of .

must apply. With defining a labelling map in the obvious way using the second column of that table, it follows that is a level-1 representation for . ∎

Armed with Lemma 3.1, we make the following central definition. Suppose that , that is a symbolic 3-dissimilarity on , and that is a semi-discriminating level-1 representation of . Then we call a -fork if is a fork on , a -triplet if is a triplet on , and a -tricycle if is a tricycle on , For ease of terminology, we will collectively refer to all three of them as a -trinet. Note that as the example of the labelled trinet depicted in Fig. 3(iii) shows, there exist trinets that are not -trinets. By abuse of terminology, we shall refer for a symbolic 3-dissimilarity on and any 3-subset to a -trinet as a -trinet.

4. Recognizing cycles: The algorithm Find-Cycles

In this section, we introduce and analyze algorithm Find-Cycles (see Algorithm 1 for a pseudo-code version). Its purpose is to recognize cycles in a level-1 representation of a symbolic 3-dissimilarity if such a representation exists. As we shall see, this algorithm relies on Property (P1) and a certain graph that can be canonically associated to . Along the way, we also establish two further crucial properties enjoyed by a level-1 representable symbolic 3-dissimilarity.

We start with introducing further terminology. Suppose is a level-1 network and is a cycle of . Then we denote by the unique vertex in for which both children are also contained in and call it the root of . In addition, we call the hybrid vertex of contained in the hybrid of and denote it by . Furthermore, we denote set of all elements of below by by and the set of all elements of below by . Clearly, . Moreover, for any leaf , we denote by the last ancestor of in . Note that is the parent of if and only if is incident with a vertex in . Last-but-not-least, we call the vertex sets of the two edge-disjoint directed paths from to the sides of . Denoting these two paths by and , respectively, we say that two leaves and in lie on the same side of if the vertices and are both interior vertices of or , and that they lie on different sides if they are not. For example, for the underlying cycle of the cycle indicated in the labelled network pictured in Fig. 1(i), we have and . Furthermore, the sides of are and and lie on one side of whereas and lie on different sides of .

Suggested by Property (U2), the following property is of interest to us where denotes again a symbolic 3-dissimilarity on :

  1. For all distinct for which holds there exists exactly one subset of size such that a tricycle on underlies a level-1 representation of .

As a first result, we obtain

Lemma 4.1.

Suppose is a level-1 representable symbolic 3-dissimilarity on . Then satisfies the Helly-type Property as well as Property (P2).

Proof.

Note first that Property (P1) is a straight-forward consequence of Lemma 2.1.

To see that Property (P2) holds, note first that since is level-1 representable there exists a labelled level-1 network such that , for all subsets of size or . Suppose distinct are such that . To see that there exists some for which is a -tricycle, assume for contradiction that there exists no such set . By Theorem 2.2, cannot be a phylogenetic tree on and, so, must contain at least one cycle . Without loss of generality, we may assume that , and lies on one of the two sides of . By assumption and so either and lie on opposite sides of , or and lie on the same side of and lies on the directed path from to . As can be easily checked, either one of these two cases yields a contradiction since then cannot hold for , as required.

To see that there can exist at most one such tricycle on , assume for contradiction that there exist tow tricycles and with . Then . Choose . Note that the assumption on the elements of implies that or must be below the hybrid vertex of one of and but not the other. Without loss of generality we may assume that is below the hybrid vertex of but not below the hybrid vertex of . Then must lie on a side of the unique cycle of . But this is impossible since the unique cycle of and are induced by the same cycle of . ∎

We remark in passing that the proof of uniqueness in the proof of Lemma 4.1 combined with the structure of a level-1 network, readily implies the following result.

Lemma 4.2.

Suppose that is a symbolic 3-dissimilarity on that is level-1 representable by a labelled network and that are three distinct elements such that is a -tricycle. Let denote the unique cycle in such that and , and let . If is a -tricycle then and if is a -tricycle then and and lie on the same side of .

To better understand the structure of a symbolic 3-dissimilarity , we next associate to a graph defined as follows. The vertices of are the -tricycles and any two -tricycles and are joined by an edge if . For example, consider the symbolic 3-dissimilarity induced by the labelled level-1 network pictured in Fig. 1(i). Then the graph presented in Fig. 5 is .

Figure 5. The graph , where is the labelled level-1 network depicted in Fig. 1(i).

The example in Fig. 5 suggests the following property for a symbolic 3-dissimilarity to be level-1 representable:

  1. If and are -tricycles contained in the same connected component of , then

We collect first results concerning Property (P3) in the next proposition.

Proposition 4.3.

Suppose is a symbolic 3-dissimilarity. If is level-1 representable or holds then Property (P3) must hold. In particular, if is a level-1 representation for then there exists a canonical injective map from the set of connected components of to the set of cycles of the level-1 network underlying .

Proof.

Suppose first that is level-1 representable. Let denote a level-1 representation of . Then . Since holds for all cycles of , and any and any that lie on different sides of , Property (P3) follows.

Suppose next that . It suffices to show that Property (P3) holds for any two adjacent vertices of . Suppose and are two such vertices and that are such that . Then there exists some such that either or where . Without loss of generality we may assume that . In view of the Table 1, we clearly have . Since, in addition, holds in the former case it follows that . In the latter case, we obtain and thus, follows in this case too as .

The claimed injective map is a straight-forward consequence of Lemma 4.2. ∎

Algorithm Find-Cycles exploits the injection mentioned in Proposition 4.3 by interpreting for a symbolic 3-dissimilarity a connected component of in terms of two sets and . Note that if is a cycle in the level-1 network underlying a level-1 representation of (if such a representation exists!), the sets and coincide and holds.

Input: A symbolic 3-dissimilarity on .
Output: A number and pairs of subsets of , , or the statement “ is not level-1 representable’’.
1 if  satisfies Property (P1) then
2       Build the graph ;
3       Denote by the number of connected components of ;
4       for  do
5             Let denote a connected component of ;
6             set ;
7             set ;
8            
9       end for
10      return ;
11      
12 end if
13else
14       return is not level-1 representable;
15      
16 end if
Algorithm 1 Find-Cycles – Property (P1) is checked in Line 1.

For example, for the symbolic 3-dissimilarity induced by the labelled network depicted in Fig. 1(i), algorithm Find-Cycles returns the three pairs , and where we write for a set .

5. Constructing cycles: The algorithm Build-Cycles

We next turn our attention toward reconstructing a structurally very simple level-1 representation of a symbolic 3-dissimilarity (should such a representation exist). For this, we use algorithm Build-Cycles which takes as input a symbolic 3-dissimilarity and a pair returned by Find-Cycles when given .

To state Build-Cycles, we require further terminology. Suppose is a level-1 network. Then we say that is partially resolved if all vertices in a cycle of have degree three. Note that partially-resolved level-1 networks may have interior vertices not contained in a cycle that have degree three or more. Thus such networks need not be binary. If, in addition to being partially resolved, is such that it contains a unique cycle such that every non-leaf vertex of is a vertex of then we call simple.

Algorithm Build-Cycle (see Algorithm 2 for a pseudo-code version) relies on a further graph called the TopDown graph associated to a symbolic 3-dissimilarity . For a pair returned by algorithm Find-Cycle when given and and , that graph essentially orders the vertices of . Thus, for each connected component of , Build-Cycle computes a level-1 representation of corresponding to (should such a representation exist).

We start with presenting a central observation concerning labelled level-1 networks.

Lemma 5.1.

Suppose is a labelled level-1 network, and is a cycle of . Suppose also that are three elements such that , and . Then, lies on the directed path from to if and only if is a -triplet.

Proof.

Put . Suppose first that lies on the directed path from to . Then and . Hence, . By Table 1, is a -triplet.

Conversely, suppose that is a -triplet. Then, by Table 1, we have . Since and , it follows that . But then and must lie on the same side of as otherwise follows which is impossible by assumption on , and . Thus, either must lie on a directed path from to or must lie on a directed path from to . However cannot be a vertex on as otherwise holds and, so, follows, which is impossible. Thus must be a vertex on . ∎

With and as in from Lemma 5.1, it follows from Lemma 4.2, that whenever algorithm Find-Cycles is given as input, it returns a pair such that and . Moreover giving and as input to algorithm Build-Cycle, Lemma 5.1 implies that Build-Cycle finds all elements for which there exists some such that lies on the path from to . However it should be noted that if is such that holds for all vertices on the path from to then the information captured by for , , and is in general not sufficient to decide if and lie on the same side of or not. In fact, it is easy to see that, in general, need not even hold.

We now turn out attention to the aforementioned TopDown graph associated to a symbolic 3-dissimilarity on which is defined as follows. Suppose that , and that . Then the vertex set of the TopDown graph is and two elements distinct are joined by a direct edge if is a -triplet.

,                

Figure 6. For the symbolic -dissimilarity induced by the labelled network pictured in Fig. 1(i), we depict in (a) the graph and in (b) the graph . In both graphs, the vertices are indicated by “”. – See text for details.
Input: A symbolic 3-dissimilarity on that satisfies Property (P1) and a pair returned by algorithm Find-Cycle when given .
Output: Either a labelled simple level-1 network on a partition of a subset of such that and holds for the unique cycle of , or the statement “ is not level-1 representable’’.
1 set rep=0;
2 Choose a -tricycle , where and ;
3 set ;
4 set ;
5 Initialize as a graph with three vertices respectively labelled by , and , and the edge ;
6 if for all , and , is a -tricycle and  then
7       set ;
8       set ;
9       set ;
10       if for all  then
11             for  do
12                   set ;
13                   if  for all and does not contain a directed cycle then
14                         set ;
15                         set rep=rep+1;
16                         while  do
17                               Add a new child to ;
18                               set ;
19                               Delete from all vertices in ;
20                               if for all , ,  then
21                                     Choose some ;
22                                     set ;
23                                     Add the leaf as a child of ;
24                                     set ;
25                                    
26                               end if
27                              else
28                                     Remove all vertices from ;
29                                     set rep=rep-1;
30                                    
31                               end if
32                              
33                         end while
34                        Add the edge ;
35                        
36                   end if
37                  
38             end for
39            
40       end if
41      
42 end if
43if rep=2 then
44       return ;
45      
46 end if
47else
48       return is not level-1 representable;
49      
50 end if
Algorithm 2 Build-Cycle – The set is the set , Property (P4) is checked in Lines 6, 10, and 20, and Properties (P3), (P6), (P7) and (P8) are checked in Lines 6, 13, 10 and 20, respectively.– See text for details.

Rather than continuing with our analysis of algorithm Build-Cycle we break for the moment and illustrate it by means of an example. For this we return again to the symbolic 3-dissimilarity on induced by the labelled level-1 network depicted in Fig. 1(i). Suppose is a pair returned by algorithm Find-Cycle and is the -tricycle chosen in line 2 of Build-Cycle. Then , and (lines 3 and 4), and and (lines 8 and 9). The graph is depicted in Fig. 6(a). It implies that for the cycle associated to the pair in a level-1 representation of , we must have and that one of the two sides of is . Since , the other side of is (lines 11 to 33).

Continuing with our analysis of algorithm Build-Cycle, we remark that the fact that the TopDown graph in the previous example is non-empty is not a coincidence. In fact, it is easy to see that the graph defined in line 14 of Build-Cycle is non-empty whenever is level-1 representable. Thus, the DAG returned by algorithm Build-Cycle cannot contain multi-arcs. Note however that there might be tricycles induced by of the form with as, for example, might hold and thus is not a -tricycle. Note that similar reasoning also applies to and the extensions of and to and defined in lines 8 and 9, respectively. Also note that the sets and are dependent on the choice of the -tricycle in line 2. However, line 6 ensures that the labelled simple level-1 network returned by algorithm Build-Cycle is independent of the choice of that -tricycle.

To establish Proposition 5.3 which ensures that algorithm Build-Cycle terminates, we next associate to a directed graph a new graph by successively removing vertices of indegree zero and their incident edges until no such vertices remain. As a first almost trivial observation concerning that graph we have the following straight-forward result whose proof we again omit.

Lemma 5.2.

Let be a directed graph. Then is nonempty if and only if contains a directed cycle.

Given as input to algorithm Build-Cycle a symbolic 3-dissimilarity that satisfies Property (P1) and a pair returned by algorithm Build-Cycle for we have

Proposition 5.3.

Algorithm Build-Cycle terminates.

Proof.

As is easy to check the only reason for algorithm Build-Cycle not to terminate is the while loop initiated in its line 16. For , this while loop works by successively removing vertices of indegree 0 (and their incident edges) from the graph , and terminates if the resulting graph, i. e. , is empty. Since line 13 ensures that this loop is entered if and only if does not contain a directed cycle, Lemma 5.2 implies that Build-Cycle terminates. ∎

It is straight-forward to see that when given a level-1 representable symbolic 3-dissimilarity such that the underlying level-1 network is in fact a simple level-1 network the labelled network returned by algorithm Build-Cycle satisfies the following three additional properties (where we use the notations introduced in algorithm Build-Cycle).

  1. For , we have and .

  2. For all and all , we have .

  3. For all and , the graphs and are isomorphic and do not contain a directed cycle.

Since the quantities on which these properties are based also exist for general symbolic 3-dissimilarities we next study Properties (P4) - (P6) for such dissimilarities. As a first consequence of Property (P4) combined with Properties (P1) and (P2), we obtain a sufficient condition under which the TopDown graph considered in algorithm Build-Cycle does not contain a directed cycle (lines 13). For convenience, we employ again the notation used in Algorithm 2.

Proposition 5.4.

Suppose that is a symbolic 3-dissimilarity that satisfies Properties (P1), (P2) and (P4), that is a pair returned by algorithm Find-Cycles when given , and that , and are as specified as in line 2 of algorithm Build-Cycle. Then the following hold for .
(i) If contains a directed cycle then it contains a directed cycle of size 3.
(ii) does not contain a directed cycle of length 3 whenever holds.

Proof.

(i) By symmetry, it suffices to show the proposition for . Suppose contains a directed cycle. Over all such cycles in , choose a directed cycle of minimal length. If , then the statement clearly holds.

Suppose for contradiction for the remainder that . Suppose are such that , , are three directed edges in . We next distinguish between the cases that and that .

Suppose . Then since , Lemma 4.2 combined with the minimality of implies that we either have a -fork on or the -triplet . Hence, holds in either case. Note that similar arguments also imply that . Since , the directed edges and cannot be contained in and, using again similar arguments as before, must hold. In combination, we obtain which is impossible in view of being an edge in and thus .

Suppose . By the minimality of , neither , nor can be a directed edge in . Using similar arguments as in the previous case, it follows that and . Combined with the facts that , , are directed edges in and that must also be an edge in as , it follows that with and we have

(1)

Note that, must also hold as otherwise and so, in view of Table 1, would be level-1 representable by a -tricycle on . But then which is impossible in view of Property (P4). Similarly, one can show that . By combining a case analysis as indicated in Table 1 with Equation 1, it is straight-forward to see that each of the four detailed combinations of and in that table yields a contradiction in view of Property (P2).

(ii) By symmetry, it suffices to assume . Let and assume for contradiction that