Construction of Gene and Species Trees from Sequence Data incl. Orthologs, Paralogs, and Xenologs

# Construction of Gene and Species Trees from Sequence Data incl. Orthologs, Paralogs, and Xenologs

Marc Hellmuth
University of Greifswald
Department of Mathematics and Computer Science
Walther- Rathenau-Strasse 47,
D-17487 Greifswald, Germany,
and
Saarland University
Center for Bioinformatics
Building E 2.1, P.O. Box 151150,
D-66041 Saarbrücken, Germany
Email: mhellmuth@mailbox.org
Nicolas Wieseke Leipzig University
Parallel Computing and Complex Systems Group
Department of Computer Science
Augustusplatz 10, D-04109 Leipzig, Germany
Email: wieseke@informatik.uni-leipzig.de

###### Abstract

Phylogenetic reconstruction aims at finding plausible hypotheses of the evolutionary history of genes or species based on genomic sequence information. The distinction of orthologous genes (genes that having a common ancestry and diverged after a speciation) is crucial and lies at the heart of many genomic studies. However, existing methods that rely only on 1:1 orthologs to infer species trees are strongly restricted to a small set of allowed genes that provide information about the species tree. The use of larger gene sets that consist in addition of non-orthologous genes (e.g. so-called paralogous or xenologous genes) considerably increases the information about the evolutionary history of the respective species. In this work, we introduce a novel method to compute species phylogenies based on sequence data including orthologs, paralogs or even xenologs.

## 1 Introduction

Sequence-based phylogenetic approaches heavily rely on initial data sets to be composed of 1:1 orthologous sequences only. To this end alignments of protein or DNA sequences are employed whose evolutionary history is believed to be congruent to that of the respective species, a property that can be ensured most easily in the absence of gene duplications or horizontal gene transfer. Phylogenetic studies thus judiciously select families of genes that rarely exhibit duplications (such as rRNAs, most ribosomal proteins, and many of the housekeeping enzymes). In the presence of gene duplications, however, it becomes necessary to distinguish between the evolutionary history of genes (gene trees) and the evolutionary history of the species (species trees) in which these genes reside.

Recent advances in mathematical phylogenetics, based on the theory of symbolic ultrametrics [Boeckner:98], have indicated that gene duplications can also convey meaningful phylogenetic information provided orthologs and paralogs can be distinguished with a degree of certainty [HHH+13, HLS+15, HHH+12].

Here, we examine a novel approach and explain the conceptional steps for the inference of species trees based on the knowledge of orthologs, paralogs or even xenologs [HHH+13, HLS+15, HHH+12].

## 2 Preliminaries

We give here a brief summary of the main definitions and concepts that are needed.

#### Graphs, Gene Trees and Species Trees

An (undirected) graph is a pair with non-empty vertex set and edge set containing two-element subsets of . A class of graphs that will play an important role in this contribution are cographs. A graph is a cograph iff does not contain an induced path on four vertices, see [Corneil:81, Corneil:85] for more details.

A tree is a connected, cycle-free graph. We distinguish two types of vertices in a tree: the leaves which are contained in only one edge and the inner vertices which are contained in at least two edges. In order to avoid uninteresting trivial cases, we will usually assume that has at least three leaves.

A rooted tree is a tree in which one special (inner) vertex is selected to be the root. The least common ancestor of two vertices and in a rooted tree is the first (unique) vertex that lies on the path from to the root and to the root. We say that a tree contains the triple if and are leaves of and the path from to does not intersect the path from to the root of . A set of triples is consistent if there is a rooted tree that contains all triples in .

An event-labeled tree, usually denoted by the pair , is a rooted tree together with a map that assigns to each inner vertex an event . For two leaves and of an event-labeled tree its least common ancestor is therefore marked with an event , which we denote for simplicity by .

In what follows, the set will always denote a set of species and the set a set of genes. We write if a gene resides in the species .

A species tree (for ) is a rooted tree with leaf-set . A gene tree (for ) is an event-labeled tree that has as leaf-set .

We refer the reader to [sem-ste-03a] for an overview and important results on phylogenetics.

#### Binary Relations and its Graph- and Tree-Representations

A (binary) relation over (an underlying set) is a subset of . We will write to denote the irreflexive part of .

Each relation has a natural representation as a graph with vertex set and edges connecting two vertices whenever they are in relation . In what follows, we will always deal with irreflexive symmetric relations, which we call for simplicity just relations. Therefore, the corresponding graphs can be considered as undirected graphs without loops, that is, and, additionally, iff (and thus, ).

While Graph-Representations of are straightforward and defined for all binary relations, tree-representations of are a bit more difficile to derive and, even more annoying, not every binary relation does have a tree-representation. For each tree representing a relation over the leaf-set is and a specific event-label is chosen so that the least common ancestor of two distinct elements is labeled in a way that uniquely determines whether or not. That is, an event-labeled tree with events “” and “” on its inner vertices represents a (symmetric irreflexive) binary relation if for all it holds that if and only if .

The latter definitions can easily be extended to arbitrary disjoint (irreflexive symmetric) relations over : An edge-colored graph represents the relations if it holds that if and only if and the edge is colored with “”. Analogously, an event-labeled tree with events “” and “” on its inner vertices represents the relations if for all it holds that if and only if , . The latter implies that for all pairs that are in none of the relations we have .

In practice, the disjoint relations correspond to the evolutionary relationship between genes contained in , as e.g. the disjoint relations and that comprise the pairs of orthologous and paralogous genes, respectively.

#### Paralogy, Orthology, and Xenology

The current flood of genome sequencing data poses new challenges for comparative genomics and phylogenetics. An important topic in this context is the reconstruction of large families of homologous proteins, RNAs, and other genetic elements. The distinction between orthologs, paralogs, and xenologs is a key step in any research program of this type. The distinction between orthologous and paralogous gene pairs dates back to the 1970s: two genes whose least common ancestor in the gene tree corresponds to a duplication are paralogs; if the least common ancestor was a speciation event and the genes are from different species, they are orthologs [Fitch:70]. The importance of this distinction is two-fold: On the one hand, it is informative in genome annotation and, on the other hand, the orthology (or paralogy) relation conveys information about the events corresponding to internal nodes of the gene tree [HHH+13] and about the underlying species tree [HLS+15, HHH+12]. We are aware of the controversy about the distinction between orthologous and paralogous genes and their consequence in the context of gene function, however, we adopt here the point of view that homology, and therefore also orthology and paralogy, refer only to the evolutionary history of a gene family and not to its function [GB00, GK13].

In contrast to orthology and paralogy, the definition of xenology is less well established and by no means consistent in the biological literature. Xenology is defined in terms of horizontal gene transfer (HGT), that refers to the transfer of genes between organisms in a manner other than traditional reproduction and across species. The most commonly used definition stipulates that two genes are xenologs if their history since their common ancestor involves horizontal gene transfer of at least one of them [Fitch2000, Jensen:01]. In this setting, both orthologs and paralogs may at the same time be xenologs [Jensen:01]. Importantly, the mathematical framework established for evolutionary “event”-relations, as the orthology relation [Boeckner:98, HHH+13], naturally accommodates more than two types of events associated with the internal nodes of the gene tree. It is appealing, therefore, to think of a HGT event as different from both speciation and duplication, in line with [Gray:83] where the term “xenologous” was originally introduced.

In this contribution, we therefore will consider a slight modification of the terms orthologs, paralogs and xenologs, so-called -orthologs, -paralogs and -xenologs. To this end, note that for a set of genes , the evolutionary relationship between two homologous genes contained in is entirely explained by the true evolutionary gene-history of these genes. More precisely, if is a (known) tree reflecting the true gene-history together with the events that happened, that is, the labeling that tags the inner vertices of as a speciation, duplication or HGT event, respectively, then we can determine the three disjoint relations and comprising the pairs of so-called -orthologous, -paralogous and -xenologous genes, respectively, as follows: Two genes are

• -orthologous, if ;

• -paralogous, if and

• -xenologous, if .

The latter also implies the edge-colored graph representation , see Figure 1 for an illustrative example.

In the absence of horizontal gene transfer, the relations -orthologs and -paralogs are equivalent to orthologs and paralogs as defined by Fitch [Fitch2000].

We are aware of the fact that this definition of -“events” leads to a loss of information of the direction of the HGT event, i.e., the information of donor and acceptor. However, for the proposed method and to understand the idea of representing estimates of evolutionary relationships in an event-labeled tree this information is not necessarily needed. Nevertheless, generalizations to tree-representations of non-symmetric relations or a mathematical framework for xenologs w.r.t. the notion of Fitch might improve the proposed methods.

###### Remark 1.

If there is no risk of confusion and if not stated differently, we call -orthologs, -paralogs, and -xenologs simply orthologs, paralogs and xenologs, respectively.

Clearly, evolutionary history and the events of the past cannot be observed directly and hence, must be inferred, using algorithmic and statistical methods, from the genomic data available today. Therefore, we can only deal with estimates of the relations and . In this contribution, we use those estimates to reconstruct (a hypothesis of) the evolutionary history of the genes and, eventually, the history of the species the genes reside in.

We wish to emphasize that the three relations and (will) serve as illustrative examples and the cases or are allowed. In practice, it is possible to have more than these three relations. By way of example, the relation containing the pairs of paralogous genes might be more refined, since gene duplications have several different mechanistic causes that are also empirically distinguishable in real data sets. Thus, instead of heaving a single relation that comprises all paralogs, we could have different types of paralogy relations that distinguish between events such as local segmental duplications, duplications by retrotransposition, or whole-genome duplications [Zhang:03].

## 3 From Sequence Data to Species Trees

In this section, we provide the main steps in order to infer event-labeled gene trees and species trees from respective estimated event-relations. An implementation of these steps by means of integer linear programming is provided in the software tool ParaPhylo [HLS+15].

The starting point of this method is an estimate of the (true) orthology relation . From this estimate the necessary information of the event-labeled gene trees and the respective species trees will be derived.

### 3.1 Orthology Detection

The inference of the orthology relation and lies at the heart of many reconstruction methods. Orthology inference methods can be classified based on the methodology they use to infer orthology into tree-based and graph-based methods, for an overview see e.g. [Altenhoff:09, DAAGD2013, gabaldon:08, KWMK:11, T+11].

Tree-based orthology inference methods rely on the reconciliation of a constructed gene tree (without event-labeling) from an alignment of homologous sequences and a given species tree, see e.g. [Arvestad03072003, SPJ:11, Hubbard+07, HBNH:07, WPFR:07].

Although tree-based approaches are often considered as very accurate given a species tree, it suffers from high computational costs and is hence limited in practice to a moderate number of species and genes. A further limitation of those tree-reconciliation methods is that for many scenarios the species tree is not known with confidence and, in addition, all practical issues that complicate phylogenetic inference (e.g. variability of duplication rates, mistaken homology, or HGT) limit the accuracy of both the gene and the species trees.

Intriguingly, with graph-based orthology inference methods it is possible in practice to detect the pairs of orthologous genes with acceptable accuracy without constructing either gene or species trees. Many tools of this type have become available over the last decade. To name only a few, COG [TG+00], OMA [SDG:07, ASGD:11], eggNOG [JJK+08], OrthoMCL [li2003orthomcl, CMSR:06], InParanoid [inparanoid:10], Roundup 2.0 [DeLuca12], EGM2 [Mahmood30122011] or ProteinOrtho [Lechner:11a] and its extension PoFF [Lechner:14]. Graph-based methods detect orthologous genes for two (pairwise) or more (multiple) species. These methods consist of a graph construction phase and, in some cases, a clustering phase [T+11]. In the graph construction phase, a graph is inferred where vertices represent genes, and (weighted) edges the (confidence of) orthology relationships. The latter rely on pairwise sequence similarities (e.g., basic local alignment search tool (BLAST) or Smith-Waterman) calculated between all sequences involved and an operational definition of orthology, for example, reciprocal best hit (RBH), bi-directional best hit (BBH), symmetrical best hit (SymBeT) or reciprocal smallest distance (RSD). In the clustering phase, clusters or groups of orthologs are constructed, using e.g., single-linkage, complete-linkage, spectral clustering or Markov Cluster algorithm. However, orthology is a symmetric, but not a transitive relation, i.e., it does in general not represent a partition of the set of genes . In particular, a set of genes can be orthologous to another gene but the genes within are not necessarily orthologous to each other. In this case, the genes in are called co-orthologs to gene [Koonin:05]. It is important to mention that, therefore, the problem of orthology detection is fundamentally different from clustering or partitioning of the input gene set.

In addition to OMA and ProteinOrtho only Synergy, EGM2, and InParanoid attempt to resolve the orthology relation at the level of gene pairs. The latter two tools can only be used for the analysis of two species at a time, while Synergy is not available as standalone tool and therefore cannot be applied to arbitrary user-defined data sets. In particular, the use of orthology inference tools is often limited to the species offered through the databases published by their authors. An exception is provided by ProteinOrtho [Lechner:11a] and its extension PoFF [Lechner:14], methods that we will use in our approach. These standalone tools are specifically designed to handle large-scale user-defined data and can be applied to hundreds of species containing millions of proteins at ones. In particular, such computations can be performed on off-the-shelf hardware [Lechner:11a]. ProteinOrtho and PoFF compare similarities of given gene sequences (the bit score of the blast alignment) that together with an an E-value cutoff yield an edge-weighted directed graph. Based on reciprocal best hits, an undirected subgraph is extracted (graph construction phase) on which spectral clustering methods are applied (clustering phase), to determine significant groups of orthologous genes. To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination between orthologs and paralogs.

To summarize, graph-based methods have in common, that the output is a set of (pairs of) putative orthologous genes. In addition, orthology detection tools often report some weight or confidence value for and to be orthologs or not. This gives rise to a symmetric, irreflexive binary relation

 ˆRo ={(x,y)∣x,y∈G\ are\ estimated\ orthologs} (1) ={(x,y)∣lcaT(x,y)\lx@stackrel\scalebox1.6[1]$\scalebox1[0.3]$∧=tspeciation\ (in the estimated gene tree T)}. (2)

### 3.2 Construction of Gene Trees

#### Characterization of Evolutionary Event Relations

Assume we have given a “true” orthology relation over , i.e., comprises all pairs of “true” orthologs, that is, if the true evolutionary history of the genes would be known, then if and only if . As we will show, given such a true relation without the knowledge of the gene tree , it is possible to reconstruct the “observable discriminating part” of using the information contained in , resp., only, at least in the absence of xenologous genes [HHH+13, HW:15]. In the presence of HGT-events, but given the “true” relations and it is even possible to reconstruct using the information contained in and only [HHH+13, HW:15]. Note, for the set of pairs of (-)xenologs we have

 Rx=⌊G×G⌋irr∖(Ro∪Rp).

Clearly, since we do not know the true evolutionary history with confidence, we always deal with estimates of these true relations . In order to understand under which conditions it is possible to infer a gene tree that represents the disjoint estimates , we characterize in the following the structure of their graph-representation . Note, if , then is a complete edge-colored graph, i.e., for all distinct there is an edge s.t.  is colored with with “” if and only if ,

The following theorem is based on results established by Böcker and Dress [Boeckner:98] and Hellmuth et al. [HHH+13].

###### Theorem 1 ([Boeckner:98, Hhh+13]).

Let be the graph-representation of the relations over some set . There is an event-labeled gene tree representing if and only if

• the graph is a cograph for all and

• for all three distinct genes the three edges and in have at most two distinct colors.

Clearly, in the absence of xenologs and thus, if , we can ignore condition , since at most two colors occur in . In the latter case, , resp., alone provide all information of the underlying gene tree.

Theorem 1 implies that whenever we have estimates or and we want to find a tree that represents these relations we must ensure that neither , nor contains an induced path on four vertices and that there is no triangle (a cycle on three vertices) in where each edge is colored differently. However, due to noise in the data or mispredicted events of pairs of genes, the graph will usually violate condition or . A particular difficulty arises from the fact, that we usually deal with the estimate only, and do not know how to distinguish between the paralogs and xenologs.

One possibility to correct the initial estimates to the “closest” relations so that there is a tree representation of , therefore, could be the change of a minimum number of edge-colors in so that fulfills Condition and . This problem was recently shown to be NP-complete [HW:15, HW16, Liu:12].

#### Inference of Local Substructures of the Gene Tree

Assume we have given (estimated or true) relations so that the graph-representation fulfills Condition and of Theorem 1. We show now briefly, how to construct the tree that represents .

Here we utilize the information of triples that are extracted from the graph and that must be contained in any gene-tree representing . More precisely, given the relations we define the set of triples as follows: For all three distinct genes we add the triple to if and only if the colors of the edge and are identical but distinct from the color of the edge in . In other words, for the given evolutionary relations the triple is added to iff the two genes and , as well as and are in the same evolutionary relationship, but different from the evolutionary relation between and .

###### Theorem 2 ([Boeckner:98, Hhh+13]).

Let be the graph-representation of the relations . The graph fulfills conditions and of Theorem 1 (and thus, there is a tree representation of ) if and only if there is a tree that contains all the triples in .

The importance of the latter theorem lies in the fact, that the well-known algorithm BUILD [Aho:81, sem-ste-03a] can be applied to to determine whether the set of triples is consistent, and, if so, constructs a tree representation in polynomial-time. To obtain a valid event-label for such a tree we can simply set if the color of the edge in is “”, [HHH+13].

It should be stressed that the evolutionary relations do not contain the full information on the event-labeled gene tree, see Fig. 2. Instead, the constructed gene trees are homeomorphic images of the (possibly true) observable gene tree by collapsing adjacent events of the same type [HHH+13]. That is, in the constructed tree all inner vertices that are connected by an edge will have different event-labels, see Fig 2. Those trees are also known as discriminating representation, cf. [Boeckner:98]. However, these discriminating representations contain and provide the necessary information to recover the input-relations, are unique (up to isomorphism), and do not pretend a higher resolution than actually supported by the data.

### 3.3 Construction of Species Trees

While the latter results have been established for ()-orthologs, -paralogs and -xenologs, we restrict our attention in this subsection to orthologous and paralogous genes only and assume that there are no HGT-events in the gene trees. We shall see later, that in practical computation the existence of xenologous genes does not have a large impact on the reconstructed species history, although the theoretical results are established for gene histories without xenologous genes.

In order to derive for a gene-tree (that contains only speciation and duplication events) a species tree with which can be reconciled with or simply spoken “embedded” into, we need to answer the question under which conditions there exists such a species tree for a given gene tree.

A tree with leaf set is a species tree for a gene tree with leaf set if there is a reconciliation map that map the vertices in to vertices or edges in . A reconciliation map maps the genes in to the respective species in the gene resides in so that specific constraints are fulfilled. In particular, the inner vertices of with label “speciation” are mapped to the inner vertices of , while the duplication vertices of are mapped to the edges in so that the relative “evolutionary order” of the vertices in is preserved in . We refer to [HHH+12] for the full definition of reconciliation maps. In Fig. 1, the reconciliation map is implicitly given by drawing the species tree superimposed on the gene tree.

Hence, for a given gene tree we wish to efficiently decide whether there is a species tree in which can be embedded into, and if so, construct such a species tree together with the respective reconciliation map. We will approach the problem of deriving a species tree from an event-labeled gene tree by reducing the reconciliation map from gene tree to species tree to rooted triples of genes residing in three distinct species. To this end we define a species triple set derived from that provides all information needed to efficiently decide whether there is a species tree for or not.

Let be the set of all triples that are contained in s.t.  reside in pairwise different species and , then set

 S\coloneqq{AB|C:∃ab|c∈Ro(T) with a∈A,b∈B,c∈C}.

It should be noted that by results established in [Boeckner:98, HHH+13] it is possible to derive the triple set directly from the orthology relation without constructing a gene tree, cf. [HHH+13]: if and only if

1. and are pairwise different species

and there are genes so that either

1. and or

2. and there is a gene with and .

Thus, in order to infer species triples a sufficient number of duplication events must have happened. The following important result was given in [HHH+12].

###### Theorem 3.

Let be a given gene tree that contains only speciation and duplication events. Then there is a species tree for if and only if there is any tree containing all triples in .

In the positive case, the species tree and the reconciliation between and can be found in polynomial time.

Interestingly, the latter theorem implies that the gene tree can be embedded into any tree that contains the triples in . Hence, one usually wants to find a species tree with a least number of inner vertices, as those trees constitute one of the best estimates of the phylogeny without pretending a higher resolution than actually supported by the data. Such trees are also called minimally resolved tree and computing such trees is an NP-hard problem [Jansson:12].

Despite the variance reduction due to cograph editing, noise in the data, as well as the occasional introduction of contradictory triples as a consequence of horizontal gene transfer is unavoidable. The species triple set collected from the individual gene families thus will not always be consistent. The problem of determining a maximum consistent subset of an inconsistent set of triples is NP-hard and also APX-hard, see [Byrka:10a, vanIersel:09]. Polynomial-time approximation algorithms for this problem and further theoretical results are reviewed in [Byrka:10].

The results in this subsection have been established for the reconciliation between event-labeled gene trees without HGT-events and inferred species. Although there are reconciliation maps defined for gene trees that contain xenologs and respective species trees [Bansal-HGT, Bansal-HGT2], a mathematical characterization of the species triples and the existence of species trees for those gene trees, which might help also to understand the transfer events itself, however, is still an open problem.

### 3.4 Summary of the Theory

The latter results show that it is not necessary to restrict the inference of species trees to 1:1 orthologs. Importantly, orthology information alone is sufficient to reconstruct the species tree provided that (i) the orthology is known without error and unperturbed by horizontal gene transfer and (ii) the input data contains a sufficient number of duplication events. Although species trees can be inferred in polynomial time for noise-free data, in a realistic setting, three NP-hard optimization problems need to be solved.

We summarize the important working steps to infer the respective gene and species trees from genetic material.

• Compute the estimate and set .

• Edit the graph to the closest cograph with a minimum number of edge edits to obtain the graph . Note, .

• Compute the tree representation w.r.t. .

• Extract the species triple set from .

• Extract a maximal consistent triple set from .

• Compute a minimally resolved species tree that contains all triples in , and, if desired, the reconciliation map between and (cf. Thm. 3).

In the presence of horizontal transfer, in Step (W1) the xenologous genes are either predicted as orthologs or paralogs.

Furthermore, in Step (W2) it suffices to edit the graph only, since afterwards the graph representation with and, thus fulfills the conditions of Thm. 1 [Corneil:81, HHH+13]. In particular, the graphs and have then been obtained from , resp., with a minimum number of edge edits. The latter is due to the fact that the complement is the graph [Corneil:81].

To extract the species triple set in Step (W4), it suffices to choose the respective species triples using Condition (I) and (IIa)/(IIb), without constructing the gene trees and thus, Step (W3) can be ignored if the gene history is not of further interest.

## 4 Evaluation

In [HLS+15] it was already shown that for real-life data sets the paralogy-based method produces phylogenetic trees for moderately sized species sets. The resulting species trees are comparable to those presented in the literature that are constructed by “state-of-the-art” phylogenetic reconciliation approaches as RAxML [raxml:14] or MrBayes [mrbayes:12]. To this end, genomic sequences of eleven Aquificales and 19 Enterobacteriales species were analyzed. Based on the NCBI gene annotations of those species, an orthology prediction was performed using ProteinOrtho. From that prediction, phylogenetic trees were constructed using the aforementioned orthology-paralogy-based approach (working steps (W2)-(W6)) implemented as integer linear program in ParaPhylo [HLS+15]. The advantage of this approach is the computation of exact solutions, however, the runtime scales exponentially with the number of input genes per gene family and the number of species.

However, as there is no gold standard for phylogenetic tree reconstruction, three simulation studies are carried out to evaluate the robustness of the method. Using the Artificial Life Framework (ALF) [DAGD:12], the evolution of generated gene sequences was simulated along a given branch length-annotated species tree, explicitly taking into account gene duplication, gene loss, and horizontal transfer events. For realistic species trees, the -proteobacteria tree from the OMA project [ASGD:11] was randomly pruned to a size of 10 species while conserving the branch lengths. For additional details on the simulation see [HLS+15]. The reconstructed trees are then compared with the initial species trees, using the software TreeCmp [BGW-treeCmp:12]. In the provided box-blots (Fig. 3), tree distances are computed according to the triple metric and normalized by the average distance between random Yule trees, see [HLS+15] for further evaluations.

The three simulation studies are intended to answer three individual questions.

1. How much data is needed to provide enough information to reconstruct accurate species trees? (cf. Fig. 3 (top))

2. How does the method perform with noisy data? (cf. Fig. 3 (middle))

3. What is the impact of horizontal gene transfer on the accuracy of the method? (cf. Fig. 3 (down))

To construct accurate species trees, the presented method requires a sufficient amount of duplicated genes. Assuming a certain gene duplication rate, the amount of duplicated genes correlates directly with the number of genes per species, respectively the number of gene families. The first simulation study (Fig. 3 (top)) is therefore performed with several numbers of gene families, varying from 100 to 500. The simulation with ALF was performed without horizontal gene transfer and the phylogenetic trees are computed based on the unaltered orthology/paralogy relation obtained from the simulation, that is, the orthologs and paralogs can directly be derived from the simulated gene trees. It turned out that with an duplication rate of 0.005, which corresponds to approximately 8% of paralogous pairs of genes, 500 gene families are sufficient to produce reliable phylogenetic trees. With less gene families, and hence less duplicated genes, the trees tend to be only poorly resolved.

For the second study the simulated orthology/paralogy relation of 1000 gene families was perturbed by different types of noise. (i) insertion and deletion of edges in the orthology graph (homologous noise), (ii) insertion of edges (orthologous noise), and (iii) deletion of edges (paralogous noise), see Fig. 3 (middle). In the three models an edge is inserted or removed with probability . It can be observed that up to noise of approximately 10% the method produces trees which are almost identical to the initial trees. Especially, in the case of orthology overprediction (orthologous noise) the method is robust even if 25% of the input data was disturbed.

Finally, in the third analysis, data sets are simulated with different rates of horizontal gene transfer, see Fig. 3 (down). The number of HGT events in the gene trees are varied up to 15.3%, which corresponds to 39.4% of all pairs of genes having at least one HGT event on the path from to in the generated gene tree, i.e., and are xenologous with respect to the definition of Fitch [Fitch2000]. Firstly, the simulated gene sequences are analyzed using ProteinOrtho and the tree reconstruction is then performed based on the resulting orthology/paralogy prediction (Fig. 3 (down/left)). Secondly, we used both definitions of xenology, i.e., -xenologous and the notion of Fitch. Note, so far the reconstruction of species trees with ParaPhylo requires that pairs of genes are either orthologous or paralogous. Hence, we used the information of the -orthologs, -paralogs and -xenologs derived from the simulated gene trees. Fig. 3 (down/center) shows the accuracy of reconstructed species trees under the assumption that all -xenologs are “mispredicted” as -orthologs, in which case all paralogous genes are identified correctly. Fig. 3 (down/right) shows the accuracy of reconstructed species trees under the assumption that all xenologs w.r.t. the notion of Fitch are interpreted as -orthologs. The latter amounts to the “misprediction” of -xenologs and -paralogs, as -orthologs. However, all remaining -paralogs, are still correctly identified. For the orthology/paralogy prediction based on ProteinOrtho, it turned out that the resulting trees have a distance of approximately 0.3 to 0.4 to the initial species tree. Thereby, a distance of 1 refers not to a maximal distance, but to the average distance between random trees. However, the accuracy of the constructed trees appears to be independent from the amount of horizontal gene transfer. Hence, ProteinOrtho is not able to either identify the gene families correctly, or mispredicts orthologs and paralogs (due to, e.g., gene loss). In case that all paralogous genes are identified correctly, ParaPhylo produces more accurate trees. We obtain even more accurate species trees, when predicting all pairs of Fitch-xenologous genes as -orthologs, even with a large amount of HGT events.

## 5 Concluding Remarks

The restriction to 1:1 orthologs for the reconstruction of the evolutionary history of species is not necessary. Even more, it has been shown that the knowledge of only a few correct identified paralogs allows to reconstruct accurate species trees, even in the presence of horizontal gene transfer. The information of paralogs is strictly complementary to the sources of information used in phylogenomics studies, which are always based on alignments of orthologous sequences. Hence, paralogs contain meaningful and valuable information about the gene and the species trees. Future research might therefore focus on improvements of orthology and paralogy infence tools, and mathematical frameworks for tree-representations of non-symmetric relations (since HGT is naturally a directed event), as well as a characterization of the reconciliation between gene and species trees in the presence of HGT.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters