Reduction rules for the maximum parsimony distance on phylogenetic trees

Reduction rules for the maximum parsimony distance on phylogenetic trees

Steven Kelk, Mareike Fischer, Vincent Moulton, Taoyang Wu Department of Knowledge Engineering, Maastricht University, P.O. Box 616, 6200 MD Maastricht, Netherlands. (Kelk and Wu are corresponding authors).Institut für Mathematik und Informatik, Walther-Rathenau-Straße 47, 17487 Greifswald, Germany.Moulton and Wu are both affiliated to: School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, United Kingdom.
July 13, 2019
Abstract

In phylogenetics, distances are often used to measure the incongruence between a pair of phylogenetic trees that are reconstructed by different methods or using different regions of genome. Motivated by the maximum parsimony principle in tree inference, we recently introduced the maximum parsimony (MP) distance, which enjoys various attractive properties due to its connection with several other well-known tree distances, such as tbr and spr. Here we show that computing the MP distance between two trees, a NP-hard problem in general, is fixed parameter tractable in terms of the tbr distance between the tree pair. Our approach is based on two reduction rules–the chain reduction and the subtree reduction–that are widely used in computing tbr and spr distances. More precisely, we show that reducing chains to length 4 (but not shorter) preserves the MP distance. In addition, we describe a generalization of the subtree reduction which allows the pendant subtrees to be rooted in different places, and show that this still preserves the MP distance. On a slightly different note we also show that Monadic Second Order Logic (MSOL), posited over an auxiliary graph structure known as the display graph (obtained by merging the two trees at their leaves), can be used to obtain an alternative proof that computation of MP distance is fixed parameter tractable in terms of tbr-distance. We conclude with an extended discussion in which we focus on similarities and differences between MP distance and TBR distance and present a number of open problems. One particularly intriguing question, emerging from the MSOL formulation, is whether two trees with bounded MP distance induce display graphs of bounded treewidth.111Keywords: Phylogenetics, parsimony, fixed parameter tractability, chain, incongruence, treewidth.

1 Introduction

Finding an optimal tree explaining the relationships of a group of species based on datasets at the genomic level is one of the important challenges in modern phylogenetics. First, there are various methods to estimate the “best” tree subject to certain criteria, such as e.g. Maximum Parsimony or Maximum Likelihood. However, different methods often lead to different trees for the same dataset, or the same method leads to different trees when different parameter values are used. Second, the trees reconstructed from different regions of the genome might also be different, even when using the same criteria. In any case, when two (or more) trees for one particular set of species are given, the problem is to quantify how different the trees really are – are they entirely different or do they agree concerning the placement of most species?

In order to answer this problem, various distances have been proposed (see e.g. [24]). A relatively new one is the so-called Maximum Parsimony distance, or MP distance for short, which we denote [14, 19, 21]. This distance (which is a metric) is appealing in part due to the fact that it is closely related to the parsimony criterion for constructing phylogenetic trees, as well as to the Subtree Prune and Regraft (spr) and Tree Bisection and Reconnection (tbr) distances. Indeed, it is shown in [21] that the unit neighbourhood of the MP distance is larger than those of the spr and tbr distances, implying that a hill-climbing heuristic search based on the MP distance will be less likely to be trapped in a local optimum than those based on the spr or tbr distances. Recently, it has been shown that computing the MP distance is NP-hard [14, 19] even for binary phylogenetic trees. For practical purposes it is therefore desirable to determine whether computation of is fixed parameter tractable (FPT). Informally, this asks whether can be computed efficiently when (or some other parameter of the input) is small, irrespective of the number of species in the input trees. We refer to standard texts such as [12] for more background on FPT. Such algorithms are used extensively in phylogenetics, see e.g. [26] for a recent example.

An obvious approach to address this question is to try to kernelize the problem. Roughly speaking, when given two trees, we seek to simplify them as much as possible without changing so that we can calculate the distance for the simpler trees rather than the original ones. Standard procedures that have been used to kernelize other phylogenetic tree distances are the so-called subtree and chain reductions (see, for example, [1, 6, 17]). In this paper we show that the chain reduction preserves and that chains can be reduced to length 4 (but not less). Moreover, we show that a certain generalized subtree reduction, namely one where the subtrees are allowed to have different root positions, also has this property, which extends a result in [21]. Both reductions can be applied in polynomial time.

These new results allow us to leverage the existing literature on tbr distance. Specifically, in [1] Allen and Steel showed that tbr distance, denoted , is NP-hard to compute, by exploiting the essential equivalence of the problem with the Maximum Agreement Forest (maf) problem: they differ by exactly 1. In the same article they showed (again utilizing the equivalence with maf) that computation of is FPT in parameter . More specifically, it was shown that combining the subtree reduction with the chain reduction (where chains are reduced to length 3, rather than length 4 as we do here) is sufficient to obtain a reduced pair of trees where the number of species is at most a linear function of . Careful reading of the analysis in [1] shows that a linear (albeit slightly larger) kernel is still obtained for if chains are reduced to length 4 rather than 3. More recently, in [18] an exponential-time algorithm was described and implemented which computes in time where is the number of species in the trees and is the golden ratio. Combining the results of [1, 18] with the main results of the current paper (i.e. Theorems 3.1 and 4.1) immediately yields the following theorem:

Theorem 1.1.

Let and be two unrooted binary trees on the same set of species . Computation of is fixed parameter tractable in parameter . More specifically, can be computed in time where is the golden ratio and .

The constant 112/3 is obtained by multiplying the bound on the size of the kernel given in [1] () by a factor 4/3, which adjusts for the fact that here chains are reduced to length 4 rather than 3. Note also that Theorem 1.1 does not require us to apply the generalized subtree reduction: the traditional subtree reduction together with the chain reduction is sufficient.

We now summarise the rest of the paper. In the next section we collect some necessary definitions and notations, including a brief description of Fitch’s algorithm which our proofs extensively use. Then in the following three sections we establish the two reductions for the MP distance, that is, the chain reduction and the subtree reduction, and remark that a theoretical variant of Theorem 1.1 could also be attained by leveraging Courcelle’s Theorem [10, 2], extending in a non-trivial way a technique introduced in [20]. Specifically, computation of can be formulated as a sentence of Monadic Second Order Logic (MSOL) posited over an auxiliary graph structure known as the display graph. The display graph is obtained by (informally) merging the two trees at their leaves. Crucially, the length of the sentence, and the treewidth of the display graph, are shown to be both bounded as a function of .

We end with an extended discussion in which we focus on similarities and differences between MP distance and TBR distance. From a theoretical perspective the two distances sometimes behave rather differently but in practice and are often very close indeed. The major open problem that remains is whether computation of is FPT when parameterized by itself. One possible route to this result is via a strengthened MSOL formulation, but this requires a number of challenging questions to be answered. In particular, can the treewidth of the display graph be bounded as a function of (rather than )? This in turn is likely to require new structural results on the interaction between (large grid) minors in the display graph and phylogenetic incongruency parameters.

2 Preliminaries

2.1 Basic definitions

An unrooted binary phylogenetic tree on a set of species (or, more abstractly, taxa) is a connected, undirected tree in which all internal nodes have degree 3 and the leaves are bijectively labelled by . For brevity we henceforth refer to these simply as trees, and we often use the elements of to denote the leaves they label. In some cases, we have to consider rooted binary phylogenetic trees instead of unrooted ones. These trees have an additional internal node of degree 2. When referring to such trees, we will talk about rooted trees for short.

For two trees and on the same set of taxa , we write if there is an isomorphism between the two trees that preserves the labels . The expression , where , has the usual definition, namely: the tree obtained by taking the unique minimal spanning tree on and then repeatedly suppressing any nodes of degree 2.

A character on is a surjective function where is a set of states. Given a phylogenetic tree on , and a character on , an extension of to is a mapping which extends i.e. for every , . The number of mutations induced by , denoted by , is defined to be the number of edges such that . The parsimony score of on (sometimes called the length) is defined to be the minimum, ranging over all extensions of to , of the number of mutations induced by . This is denoted . Following [27], an extension that achieves this minimum is called a minimum extension (also known as an optimal extension, but here we reserve the word optimal for other use). This value can be computed in polynomial time using dynamic programming. Fitch’s algorithm is the most well-known example of this. (We will use Fitch’s algorithm extensively in this article and give a brief description of its execution in the next section).

Given two trees and on , the maximum parsimony distance of and , denoted , is defined as

 d\textscMP(T1,T2)=maxf|lf(T1)−lf(T2)|

where ranges over all characters on . A character that achieves this maximum is called an optimal character. In [14, 21] it is proven that is a metric.

Note that in this manuscript, we also compare to the well-known Tree Bisection and Reconnection (TBR) distance, denoted . Recall that a TBR move is performed as follows: Given an unrooted binary phylogenetic tree, delete one edge and suppress all resulting nodes of degree 2. Of the two trees now present, if they consist of at least two nodes, pick an edge and place a degree-2 node on it and choose it; else if either one only consists of one leaf, choose this leaf. Now connect the two chosen nodes with a new edge. This completes the TBR move. Note that is defined as the minimum number of TBR moves needed to transform into . In [14, 21] it is proven that for all trees , with both articles listing examples where the inequality is strict.

A concept which often occurs when discussing tree distances is the so-called agreement forest abstraction. Recall that, given two trees and on , an agreement forest is a partition of into non-empty subsets , such that and are isomorphic for all , and such that the subtrees and are node disjoint subtrees of for all and and for . An agreement forest with a minimum number of components is called a Maximum Agreement Forest, or MAF for short. In [1] it was proven that is equal to the number of components in a MAF, minus one.

The last concept we need to recall is fixed parameter tractability (FPT). An algorithm is fixed parameter tractable in parameter if its running time has the form where is the size of the input (here we take ) and is some (usually exponential) computable function that depends only on . For distances on trees it is quite usual to take the distance itself as the parameter, but other parameters can be chosen, and this is the approach we take in this article (i.e. we parameterize computation of in terms of ). For more formal background on FPT we refer the reader to [12].

We defer a number of definitions (concerning treewidth and display graphs) until later in the article.

2.2 Fitch’s algorithm

For a given character on , Fitch’s algorithm [15] is a well-known polynomial-time algorithm for computing and inferring a minimum extension of (see, e.g. [28], for a recent application). It has a bottom-up phase followed by a top-down phase (actually, in the original paper, Fitch introduced a second top-down phase, but this is not needed in the present manuscript and is thus ignored here). It works on rooted trees, but the location of the root is not important for computation of , so we may root the tree by subdividing an arbitrary edge with a new node and directing all edges away from this new node. (In particular, this ensures that the child-parent relation is well-defined). For each internal node of a rooted tree, let and refer to its two children.

In the first phase, the algorithm constructs the Fitch map (induced by character ) that assigns a subset of states to each of node of in the following bottom-up approach:

1. For each leaf , let .

2. For each internal node (for which and have already been computed), let

 F(u)=F(ul)∗F(ur)={F(ul)∪F(ur)if F(ul)∩F(ur)=∅,F(ul)∩F(ur)otherwise. (1)

An internal node is called a union node if the first case in Equation (1) occurs (i.e., ), and an intersection node otherwise. The value is equal to the total number of union nodes in .

For later use, an extension of on is called a Fitch-extension if (i) holds for all , and (ii) for each non-leaf node of , we have either or (but not both) if is a union node, and otherwise (i.e. is an intersection node).

In the second phase, for an arbitrary state the algorithm constructs a Fitch-extension in the following top-down manner. We start with . Suppose that is a child of for which is defined, then

 ¯¯¯f(v)={¯¯¯f(u)if ¯¯¯f(u)∈F(v),any state in F(v)otherwise. (2)

Since each union node will contribute precisely one mutation for the extension specified in Equation (2), each Fitch-extension is always minimum. (However, note that a minimum extension is not necessarily a Fitch-extension [13].) The following observation, which we use later, is immediate from the second phase of Fitch’s algorithm.

Observation 2.1.

Let be a rooted binary tree on and let be a character on . Let be the root of and consider the Fitch map induced by . For each state , there exists a Fitch-extension of such that .

3 Chain reduction

Let be an unrooted binary tree on . For a leaf , let denote the internal node of adjacent to this leaf. Then, an ordered sequence of taxa is called a chain of length if is a path in . Note that here we allow that (i.e., and have a common parent) and/or (i.e. and have a common parent): if at least one of these situations occurs we say the chain is pendant. (This is equivalent to definitions used in earlier articles). A chain is common to and if it is a chain of both trees. Suppose and have a common chain where denotes the taxa in the chain and . Let , be two new trees on where and . Then we say that , have been obtained by reducing to length 4.

Theorem 3.1.

Let and be two unrooted binary trees on the same set of taxa . Let be a common chain of length . Let and be the two trees obtained by reducing to length 4. Then .

Proof.

Note that follows from Corollary 3.5 of [21], which proves that for all , . The inequality then follows from the definition of chain reduction.

It is considerably more involved to prove the claim that holds.

Without loss of generality, we may assume that (i.e., ) as otherwise the claim clearly holds. Note that this implies and hence whenever is pendant in a tree, at least one end of the chain is attached to the main part of the tree.

We will prove the claim by considering the following three major cases: (I) the common chain is pendant in neither tree, (II) the chain is pendant in preciesly one tree, and (III) the chain is pendant in both trees.

I: Common chain is pendant in neither tree

Let be an optimal character for and i.e. . Assume without loss of generality that , so .

Let refer to the 4 subtrees of shown in Figure 1. For , let refer to the edge incoming to the root of ; let refer to the taxa in subtree ; let denote the character obtained by restricting to , and let refer to the set of states assigned to the root of by the Fitch map induced by . (Note that .) For each tree we define the chain region of to be the set of edges incident to at least one red node (as shown in Figure 1). Let be the number of union nodes among red nodes, which is the same as the number of mutations occuring in the chain region of for a Fitch-extension of . Then,

 m1 =lf(T1)−lfA(TA)−lfB(TB) and m2 =lf(T2)−lfC(TC)−lfD(TD).

In addition, let and then we have

 d\textscMP(T1,T2)=lfC(TC)+lfD(TD)−lfA(TA)−lfB(TB)+p. (3)

First we shall show that . To this end, fix a Fitch-extension of to , and consider an extension of to obtained by combining a minimum extension of to , a minimum extension of to , and exactly mimicking on the red nodes of (as indicated in Figure 1). Then compared with , the extension creates at most two new mutations on the chain region (i.e. edges and ). In other words, we have . Together with and , this implies

 p = lf(T2)−lf(T1)−lfC(TC)−lfD(TD)+lfA(TA)+lfB(TB) (4) = lf(T2)−m1−lfC(TC)−lfD(TD) ≤ Δ(¯¯¯f2)−m1−lfC(TC)−lfD(TD) ≤ 2.

Next we show . Consider a new (not necessarily optimal) character obtained from by reassigning all the taxa in to a new state that does not appear anywhere on . Considering Fitch-extensions of to and to we observe that and will both incur exactly 2 mutations in their chain regions, namely on edges , and , , respectively. That is, we have

 lf∗(T1)=lfA(TA)+lfB(TB)+2  and  lf∗(T2)=lfC(TC)+lfD(TD)+2. (5)

Since the optimality of implies , by Equation (5) we have

 p = lf(T2)−lf(T1)−lfC(TC)−lfD(TD)+lfA(TA)+lfB(TB) (6) ≥ lf∗(T2)−lf∗(T1)−lfC(TC)−lfD(TD)+lfA(TA)+lfB(TB) = 0.

By Equation (3), the claim will follow from

 d\textscMP(T′1,T′2)≥lfC(TC)+lfD(TD)−lfA(TA)−lfB(TB)+p. (7)

Therefore, to establish main case (I) it is sufficient to establish Equation (7), which will be done through case analysis on . To shorten notation we will write to denote the character on obtained from (which is a character on ) by leaving the states assigned to taxa in intact and assigning states to respectively. (Occasionally we will manipulate to obtain a new character also on , and then the expression is overloaded to denote the reassignment of states to the taxa in the original chain , not the reduced chain.) Since is an integer with , we have the following three cases to consider.

Case 1: . Let where is a state that does not appear elsewhere. Then by the “both trees incurring exactly 2 mutations in their chain regions for Fitch-extensions” reason used in the proof of Equation (5), we have , from which Equation (7) holds.

Case 2: . We require a subcase analysis on .

1. : Let . Consider a state , which is a state that does not appear elsewhere, and the character . If we consider Fitch-extensions of on and on , we see that in there are exactly 2 mutations incurred in the chain region, and in exactly 3, and we are done, because we now have and , so . The latter equality is true because we are in the case where . For brevity we henceforth speak of “an situation” when there are mutations in the chain region in tree and in , so in this case we have a (2,3) situation.

2. : This is symmetrical to the previous case.

3. : This case cannot occur. Intuitively, is “less constrained” than at the roots of the subtrees, so there is no way that can use the chain region to save mutations relative to . More formally, consider a Fitch-extension of to . Then by definition assigns a state from to the root of , and a state from to the root of (where and are not necessarily different). Since , by Observation 2.1, we fix a Fitch-extension of to that maps the root of to . Similarly, we fix a Fitch-extension of to that maps the root of to . Now consider the extension of to obtained by combining , , and exactly mimicking for the red nodes of . Then the number of mutations induced by in the chain region of is exactly the same as that by in the chain region of . In other words, we have , from which we conclude that, if , then

 d\textscMP(T1,T2)=lf(T2)−lf(T1)≤Δ(¯¯¯f2)−lf(T1)=lfC(TC)+lfD(TD)−lfA(TA)−lfB(TB).

In particular, this shows , a contradiction. We will re-use (slight variations of) this argument repeatedly to show that certain subcases cannot occur. For brevity we will refer to it as the less constrained roots argument.

Case 3: . Then we have the following two subcases to consider.

1. : Let and . We take character where does not occur elsewhere. This is a situation.

2. (: By a variant of the less constrained roots argument, we know this case cannot occur as otherwise it leads to , a contradiction.

II: Common chain is pendant in exactly one tree

Without loss of generality we assume that is pendant in and that the situation is as described in Figure 2. Let be an optimal character. Then we have the following two cases.

Case 1: In this first case , so . As in Equation (3) we have,

 d\textscMP(T1,T2)=lfC(TC)−lfA(TA)−lfB(TB)+p. (8)

In this case, because of the usual mimicking construction (i.e. copying the states allocated to the red nodes in , to ) used in the proof of Equation (4). That is, at most 1 extra mutation incurs in (i.e. on the edge 222 Here the mimicking construction must deal with a slight technicality: node in (see Figure 2) does not exist in . However, simply ignoring in this case (and elsewhere mapping to ) has the desired effect: if there is a mutation on edge in then there must have been at least one mutation on the edges and in .. On the other hand follows from an argument similar to that for proving Equation (6). That is, we can always relabel to a new character where , and is a state that does not appear elsewhere. This is either a or a situation, proving that . Hence, in Equation (8), we have , and hence it remains to prove that

 d\textscMP(T′1,T′2)≥lfC(TC)−lfA(TA)−lfB(TB)+p

holds, which will be done by considering the following two subcases.

1. : Suppose first . Let . Note that because otherwise the character would lead to a situation, contradicting . This implies that the character is a situation and we are done. So suppose next . If then let . Clearly . Taking character yields a situation and we are done. Otherwise, . In this situation, let and let . (Clearly, ). Consider character . This is a situation and we are done.

2. : Suppose . Let . If then we take . This is a situation and we are done. If , then let be an arbitrary element of . We take , this is a situation and we are done. The only subcase that remains is , but this cannot happen by the less constrained roots argument.

Case 2: We have , so . In such a case we have

 d\textscMP(T1,T2)=lfA(TA)+lfB(TB)−lfC(TC)+p. (9)

We have , by the usual mimicking argument, but this time the red nodes in copy their states from and not the other way round. (Nodes and in should both be assigned the state that is assigned to in ). Also, because we can relabel to a new character where is a state that does not appear elsewhere. This is a situation. Hence, . and hence it remains to prove that

 d\textscMP(T′1,T′2)≥lfA(TA)+lfB(TB)−lfC(TC)+p

holds, which will be done by considering the following two subcases.

1. . Take , where is a state that does not appear elsewhere. This is a situation, and we are done.

2. . Suppose . Consider where is a state that does not occur elsewhere and . This is a situation and we are done. The only remaining case is : but this is not possible by the less constrained roots argument.

III: Common chain is pendant in both trees

There are two main situations here: the chains are oriented in the same direction (Figure 3), and the chains are oriented in the opposite direction (Figure 4). Whichever situation occurs, we can assume without loss of generality that , so . As in Equation (3) we have,

 d\textscMP(T1,T2)=lfC(TC)−lfA(TA)+p. (10)

Note that we have by the familiar trick of assigning all the taxa in a state that does not occur elsewhere and by the mimicking construction. It remains to show that

 d\textscMP(T′1,T′2)≥lfC(TC)−lfA(TA)+p

holds, which can be done by considering the following three cases:

Case 1: . In this case we can just take where is a state that does not appear elsewhere: this is a situation, and we are done.

Case 2: and we are in the same-direction situation. Observe that cannot hold by the less constrained roots argument. So . Let . Consider the character where is a state that does not appear elsewhere. This is a situation and we are done.

Case 3: and we are in the opposite-direction situation. Then take where and is a state that does not occur elsewhere. This is a situation (note that here we are exploiting the fact that is reversed in relative to , the status of is not relevant here), so we are done. ∎

Note that Theorem 3.1 is in some sense best possible, since reducing common chains to length 3 can potentially alter ; see Figure 5 for a concrete example. Here (due to character ) and - due to the agreement forest - so . However, (achieved by character ); the fact that can be verified computationally.

The chain reduction can easily be performed in polynomial time, and it can be applied at most a polynomial number of times because each application of the reduction reduces the number of taxa by at least 1. Hence, we obtain the following corollary.

Corollary 3.1.

Let and be two unrooted binary trees on the same set of taxa . Then it is possible to transform to in polynomial time such that all common chains in have length at most 4 and .

4 A generalized subtree reduction

Let and be two unrooted binary trees on a set of taxa . A split (on ) is simply a bipartition of i.e. , , . For a phylogenetic tree on , we say that edge induces a split if, after deleting , is the subset of taxa appearing in one connected component and is the subset of taxa appearing in the other.

Consider . We say that and have a common pendant subtree ignoring root location (i.r.l.) on if (1) for , contains an edge such that induces a split in and (2) . Now, assume without loss of generality that for , is the endpoint of edge that is closest to taxon set . The node can be used to “root” , yielding a rooted binary phylogenetic tree on which we denote . If and have the additional property that (where here the equality operator is acting over rooted trees), then we say that