Parameterized Algorithms forthe Maximum Agreement Forest Problemon Multiple Rooted Multifurcating Trees

Parameterized Algorithms for
the Maximum Agreement Forest Problem
on Multiple Rooted Multifurcating Trees

Feng Shi     Jianer Chen   Qilong Feng   Jianxin Wang111Corresponding author, email: jxwang@mail.csu.edu.cn

School of Information Science and Engineering
Central South University
Changsha 410083, P.R. China
Department of Computer Science and Engineering
Texas A&M University
College Station, Texas 77843, USA
Abstract

The Maximum Agreement Forest problem has been extensively studied in phylogenetics. Most previous work is on two binary phylogenetic trees. In this paper, we study a generalized version of the problem: the Maximum Agreement Forest problem on multiple rooted multifurcating phylogenetic trees, from the perspective of fixed-parameter algorithms. By taking advantage of a new branch-and-bound strategy, two parameterized algorithms, with running times and , respectively, are presented for the hard version and the soft version of the problem, which correspond to two different biological meanings to the polytomies in multifurcating phylogenetic trees.

1 Introduction

Phylogenetic trees (alternatively called evolutionary trees) are an invaluable tool in phylogenetics that are used to represent the evolutionary histories of homologous regions of genomes from a collection of extant species or, more generally, taxa. However, due to reticulation events, such as hybridization, recombination, or lateral gene transfer (LGT) in evolution, phylogenetic trees constructed by different regions of genomes may have different structures. Since the reticulation events can be studied by examining these differences in structures, several metrics, such as Robinson-Foulds distance [1], Nearest Neighbor Interchange (NNI) distance [2], Hybridization number [3], Tree Bisection and Reconnection (TBR) distance, and Subtree Prune and Regraft (SPR) distance [4, 5], have been proposed in the literature to compare these different phylogenetic trees. Among these metrics, the SPR distance has been studied extensively for investigating phylogenetic inference [6], lateral genetic transfer [7, 8], and MCMC search [9].

Given two phylogenetic trees on the same collection of taxa, the SPR distance between the two trees is defined to be the minimum number of “Subtree Prune and Regraft” operations [10] needed to convert one tree to the other. Since the Subtree Prune and Regraft operation has been widely used as a method to model a reticulation event, the SPR distance provides a lower bound on the number of reticulation events needed to reconcile the two phylogenetic trees [11], which can give an indication how reticulation events influence the evolutionary history of the taxa under consideration.

For the study of SPR distance, Hein et al. [12] proposed the concept of maximum agreement forest (MAF) for two phylogenetic trees, which is a common subforest of the two trees with the minimum order among all common subforests of the two trees (the order of a forest is defined as the number of connected components of the forest). Bordewich and Semple [13] proved that the order of an MAF for two rooted binary phylogenetic trees minus is equal to their rSPR distance. Since then, much work has been focused on studying the Maximum Agreement Forest problem on two rooted binary phylogenetic trees, which asks for an MAF for the two trees.

Biological researchers traditionally assumed that phylogenetic trees were bifurcating [14, 15], which motivated most earlier work focused on the Maximum Agreement Forest problem for binary trees. However, more recent research in biology and phylogenetics has called a need to study the problem for general trees. For example, for many biological data sets in practice [16, 17], the constructed phylogenetic trees always contain polytomies (alternatively called multifurcations). There are two different meanings to the polytomies in phylogenetic trees: (1) the polytomy refers to an event during which an ancestral species gave rise to more than two offspring species at the same time [18, 19, 20, 21], which is called a hard polytomy; (2) the polytomy refers to ambiguous evolutionary relationships as a result of insufficient information, which is called a soft polytomy. Note that the types of polytomies in the phylogenetic trees have a substantial impact on designing algorithms for comparing these trees. For example, a soft polytomy with three leaves is not considered different from two resolved bifurcations of the same three leaves , as the soft polytomy is ambiguous rather than conflicting, and the soft polytomy can be binary resolved as . On the other hand, if the polytomy is hard, then and are considered different as the hard polytomy is interpreted as simultaneous speciation. In this paper, we study two versions of the Maximum Agreement Forest problem on rooted multifurcating trees: (1) the hard version, which assumes that all polytomies in the multifurcating phylogenetic trees are hard; and (2) the soft version, which assumes that all polytomies in the multifurcating phylogenetic trees are soft.

Because of the two types of polytomies, two types of rSPR distance are defined. Given two rooted multifurcating phylogenetic trees and , the hard rSPR distance between and is defined as the minimum number of rSPR operations needed to transform one tree into the other under the assumption that all polytomies in the two trees are hard 222The relationship between MAF and the metric of rSPR distance on binary trees can be naturally extended to that on multifurcating trees [22][23]., and the soft rSPR distance between and is defined as the minimum rSPR distance between all pairs of binary resolutions of and  [22]. Apparently, the hard rSPR distance captures all structural differences between the two trees, and the soft rSPR distance only captures the structural differences that cannot be reconciled by resolving the multifurcations appropriately. The hard rSPR distance between two multifurcating phylogenetic trees corresponds to their MAF under the assumption that all polytomies are hard, and the soft rSPR distance between two multifurcating phylogenetic trees corresponds to their MAF under the assumption that all polytomies are soft.

For the same collection of taxa, multiple (i.e., two or more) different phylogenetic trees may be constructed based on different data sets or different building methods. Studying the Maximum Agreement Forest problem on multiple phylogenetic trees has more biological meaning than that on two trees. For example, suppose that we have two phylogenetic trees that are constructed by two homologous regions of genomes from a collection of taxa. As mentioned above, studying the order of their MAF can indicate how reticulation events influence the evolutionary histories of two homologous regions of the genomes. Note that these reticulation events that influenced the evolutionary histories of the two homologous regions of the genomes may also influence the evolutionary histories of other homologous regions of the genomes. Thus, if we construct phylogenetic trees for each homologous region of the genomes, and study their MAF, then the order of their MAF can give a more comprehensive indication of the extent to which reticulation has influenced the evolutionary history of the collection of taxa. Moreover, consider an MAF (hard version or soft version) of order for a set of rooted phylogenetic trees. Since is also an agreement forest (not necessarily an MAF) for any two trees and in , the (hard or soft) rSPR distance between and would not be greater than . Thus, the order of an MAF for provides an upper bound for the rSPR distance between any two trees in . Last but not least, constructing an MAF for multiple phylogenetic trees is a critical step in studying the reticulate networks with the minimum number of reticulation vertices for multiple phylogenetic trees [24], which is a hot topic in phylogenetics. The reason is that among all reticulate networks for the given multiple phylogenetic trees, the number of reticulation vertices in the reticulate network with the minimum number of reticulation vertices is equal to the order of an MAF for the given multiple phylogenetic trees minus one if the MAF is acyclic [25].

To summarize, it makes perfect sense to study the Maximum Agreement Forest problem on multiple rooted multifurcating phylogenetic trees. In this paper, we will focus on parameterized algorithms for the two versions (the hard version and the soft version) of the Maximum Agreement Forest problem on multiple rooted multifurcating phylogenetic trees. In the following, we first review previous related work on the Maximum Agreement Forest problem. Note that there are two kinds of phylogenetic trees, rooted or unrooted. The only distinction between the two kinds of phylogenetic trees is that whether an ancestor-descendant relation is defined in the tree. Although in this paper we only study the rooted phylogenetic trees, we also present previous related work on unrooted phylogenetic trees. In particular, Allen and Steel [10] proved that the TBR distance between two unrooted binary phylogenetic trees is equal to the order of their MAF minus .

In terms of the computational complexity of the problems, it has been proved that computing the order of an MAF is NP-hard and MAX SNP-hard for two unrooted binary phylogenetic trees [12], as well as for two rooted binary phylogenetic trees [13].

Approximation Algorithms. For the Maximum Agreement Forest problem on two rooted binary phylogenetic trees, Hein et al. [12] proposed an approximation algorithm of ratio . However, Rodrigues et al. [26] found a subtle error in [12], showed that the algorithm in [12] has ratio at least , and presented a new approximation algorithm which they claimed has ratio . Borchwich and Semple [13] corrected the definition of an MAF for the rSPR distance. Using this definition, Bonet et al. [27] provided a counterexample and showed that, with a slight modification, both the algorithms in [12] and [26] compute a -approximation of the rSPR distance between two rooted binary phylogenetic trees in linear time. The approximation ratio was improved to 3 by Bordewich et al. [11], but the running time of the algorithm is increased to . A second 3-approximation algorithm presented in [28] achieves a running time of . Whidden et al. [29] presented the third 3-approximation algorithm, which runs in linear-time. Shi et al. [30] improved the ratio to 2.5, but the algorithm has running time . Recently, Schalekamp [31] presented a 2-approximation algorithm by LP Duality (the running time is polynomial, but the exact order of the running time is not clear), which is the best known approximation algorithm for the Maximum Agreement Forest problem on two rooted binary trees. For the Maximum Agreement Forest problem on two unrooted binary phylogenetic trees, Whidden et al. [29] presented a linear-time approximation algorithm of ratio , which is currently the best algorithm for the problem.

There are also several approximation algorithms for the Maximum Agreement Forest problem on two multifurcating phylogenetic trees. For the Maximum Agreement Forest problem on two rooted multifurcating phylogenetic trees, Rodrigues et al. [28] developed an approximation algorithm of ratio for the hard version, with running time , where is the maximum number of children a node in the input trees has. Lersel et al. [32] presented a 4-approximation algorithm with polynomial running time for the soft version. Recently, Whidden et al. [22] gave an improved -approximation algorithm with running time for the soft version. For the Maximum Agreement Forest problem on two unrooted multifurcating phylogenetic trees, Chen et al. [23] developed a -approximation algorithm with running time for the hard version.

For the Maximum Agreement Forest problem on multiple rooted binary phylogenetic trees, Chataigner [33] presented a polynomial-time approximation algorithm of ratio 8. Recently, Mukhopadhyay and Bhabak [34] and Chen et al. [35], independently, developed two 3-approximation algorithms. The running times of the two algorithms in [34] and [35] are and respectively, where denotes the number of leaves in each phylogenetic tree, and denotes the number of phylogenetic trees in the input instance. For the Maximum Agreement Forest problem on multiple unrooted binary trees, Chen et al. [35] presented a -approximation algorithm with running time . To our best knowledge, there is no known approximation algorithm for the Maximum Agreement Forest problem on multiple rooted (unrooted) multifurcating phylogenetic trees.

Parameterized Algorithms. Parameterized algorithms for the Maximum Agreement Forest problem, parameterized by the order of an MAF, have also been studied. A parameterized problem is fixed-parameter tractable [36] if it is solvable in time , where is the input size and is a computable function only depending on the parameter . For the Maximum Agreement Forest problem on two unrooted binary phylogenetic trees, Allen and Steel [10] showed that the problem is fixed-parameter tractable. Hallett and McCartin [11] developed a parameterized algorithm of running time for the Maximum Agreement Forest problem on two unrooted binary phylogenetic trees. Whidden and Zeh [29] further improved the time complexity to . For the Maximum Agreement Forest problem on two rooted binary phylogenetic trees, Bordewich et al. [11] developed a parameterized algorithm of running time . Whidden et al. [37] improved this bound and developed an algorithm of running time . Chen et al. [38] presented an algorithm of running time , which is the best known result of the Maximum Agreement Forest problem on two rooted binary phylogenetic trees.

There are also several parameterized algorithms for the Maximum Agreement Forest problem on two multifurcating phylogenetic trees. Whidden et al. [22] presented an algorithm of running time for the soft version of the Maximum Agreement Forest problem on two rooted multifurcating phylogenetic trees. Shi et al. [39] presented an algorithm of running time for the hard version of the Maximum Agreement Forest problem on two unrooted multifurcating phylogenetic trees. Chen et al. [23] developed an improved algorithm of running time , which is the best known result for the hard version of the Maximum Agreement Forest problem on two unrooted multifurcating phylogenetic trees.

For the Maximum Agreement Forest problem on multiple rooted binary phylogenetic trees, Chen et al. [24] presented a parameterized algorithm of running time 333The notation means the polynomial factors of the time complexity are omitted.. Shi et al. [40] improved this bound and developed an algorithm of running time . For the Maximum Agreement Forest problem on multiple unrooted binary phylogenetic trees, Shi et al. [40] presented the first parameterized algorithm of running time . To our best knowledge, there is no known parameterized algorithm for the Maximum Agreement Forest problem on multiple rooted (unrooted) multifurcating phylogenetic trees.

Our Contributions. In this paper, we are focused on the fixed-parameter algorithms for the two versions (the hard version and the soft version) of the Maximum Agreement Forest problem on multiple rooted multifurcating phylogenetic trees (the Maf problem). The general idea of our algorithms is similar to that of the previous parameterized algorithms for the Maximum Agreement Forest problem: remove edges from trees to reconcile the structural differences among them, then using the relation between the number of edges removed by the algorithm and the order of the resulting forest to design a branch-and-bound parameterized algorithm.

All previous parameterized algorithms employed the following strategy: (1) fix a tree and try to find a local structure in other trees that conflicts with the fixed tree; then (2) remove edges from the fixed tree to reconcile the structural difference. As a consequence, all branching operations are applied only on the fixed tree. Obviously, this way is convenient for analyzing the time complexity of the algorithm, because each branching operation would increase the order of the resulting forest in the fixed tree and the order of the resulting forest cannot be greater than the order of the MAF that we are looking for. However, this way does not take full advantage of the structural information given by all the trees. For example, there may exist a local structure in the fixed tree such that the corresponding branching operation on the other trees has better performance.

By careful and detailed analysis on the structures of phylogenetic trees, we propose a new branch-and-bound strategy such that the branching operations can be applied on different phylogenetic trees in the input instance. Then by making full use of special relations among leaves in phylogenetic trees, two parameterized algorithms for the Maf problem are presented: one is for the hard version of the Maf problem with running time , which is the first fixed-parameter algorithm for the hard version of the problem; and the other is for the soft version of the Maf problem with running time , which is also the first fixed-parameter algorithm for the soft version of the problem.

The rest of the paper is structured as follows. Section 2 gives related definitions for multifurcating phylogenetic trees and the problem formulation. Detailed presentation and analysis of our algorithm for the hard version of the Maf problem is given in Sections 3-5. The analysis of the algorithm for the soft version of the Maf problem is given in Section 6, in a similar way to that for the hard version. The conclusion is presented in Section 7.

2 Definitions and Problem Formulations

The notations and definitions in this paper follow the ones in [40]. All graphs in our discussion are undirected. For a vertex , denote the set of neighbors of by , and the degree of is equal to . Denote by the edge whose two ends are the vertices and . A tree is a single-vertex tree if it consists of a single vertex, which is the leaf of . A tree is a single-edge tree if it consists of an edge with two leaves. A tree is multifurcating if either it is a single-vertex tree or each of its vertices has degree either 1 or not less than 3. For a multifurcating tree that is not a single-vertex tree, the degree- vertices are leaves and the other vertices are non-leaves.

2.1 -tree, -forest

A label-set is a set of elements that are called “labels”. For a label-set , a multifurcating phylogenetic -tree is a multifurcating tree whose leaves are labeled bijectively by the label-set . A multifurcating phylogenetic -tree is rooted if a particular leaf is designated as the root (so it is both a root and a leaf) – in this case a unique ancestor-descendant relation is defined in the tree. The root of a rooted multifurcating phylogenetic -tree will always be labeled by a special label , which is always assumed to be in the label-set . In the following, a rooted multifurcating phylogenetic -tree is simply called an -tree. As there is a bijection between the leaves of an -forest and the labels in the label-set , we will use, without confusion, a label in to refer to the corresponding leaf in the -forest, or vice versa.

A subforest of an -tree is a subgraph of , and a subtree of is a connected subgraph of , in both case, we assume that the subgraph contains at least one leaf in . For a subtree of a rooted -tree , in order to preserve the ancestor-descendant relation in , a vertex in should be defined to be the root of . If contains the label , then it is the root of ; otherwise, the node in that is in the least common ancestor of all the labeled leaves in is defined to be the root of . An -forest is a subforest of an -tree that contains a collection of subtrees whose label-sets are disjoint such that the union of the label-sets is equal to . The number of connected components in an -forest is called the order of , denoted by .

For any vertex in an -forest , denote by the set containing all labels that are descendants of . For any subset of vertices in , denote by the union of for all . For a connected component in , denote by the set containing all labels in . For a subset of label-set , where the labels in are in the same connected component of , denote by the minimum subtree induced by the labels of in .

A subtree of an -tree may contain unlabeled vertices of degree less than . In this case the forced contraction operation is applied on , which replaces each degree- vertex and its incident edges with a single edge connecting the two neighbors of , and removes each unlabeled vertex that has degree 1. However, in order to preserve the ancestor-descendant relation in , if the root of is of degree-, then the operation will not be applied on . Since each connected component of an -forest contains at least one labeled leaf, the forced contraction does not change the order of the -forest. It is well-known (see, e.g., [11, 41]) that the forced contraction operation does not affect the construction of an MAF for -trees. Therefore, we assume that the forced contraction is applied immediately whenever it is applicable. An -forest is irreducible if the forced contraction cannot be applied to . Thus, the -forests in our discussion are assumed to be irreducible. With this assumption, in each (irreducible) -forest , the root of each connected component is either an unlabeled vertex of degree at least 2, or the vertex labeled with of degree-1, or a labeled vertex of degree-0, and each unlabeled vertex in that is not the root of has degree at least 3.

For two -forests and , if there is a graph isomorphism between and in which each leaf of is mapped to a leaf of with the same label, then and are isomorphic. We will simply say that an -forest is a subforest of another -forest if is isomorphic to a subforest of (up to the forced contraction).

2.2 Binary Resolution of -forest

An -tree is binary if either it is a single-vertex tree or each of its vertices has degree either 1 or 3 (we treat the binary -tree as a special type of -tree). A binary -forest is defined analogously.

Given two -forests and , is a binary resolution of if is a binary -forest and can be obtained by contracting some internal edges (i.e., edges between non-leaves) in . Note that if -forest is binary, then itself is the unique binary resolution of . Given two -forests and , is a binary subforest of if is a binary -forest, and there exists a binary resolution of such that is a subforest of .

2.3 Agreement Forest

Given a collection of -forests. An -forest is a hard agreement forest for if is a subforest of , for all . An -forest is a soft agreement forest for if is a binary subforest of , for all .

A hard maximum agreement forest (hMAF) for is an hard agreement forest for with the minimum order over all hard agreement forests for . The soft maximum agreement forest (sMAF) is defined analogously.

The two versions of the Maximum Agreement Forest problem on multiple -forests studied in this paper are formally defined as follows.

Hard Maximum Agreement Forest problem (hMaf)
Input: A set of -forests, and a parameter
Output: a hard agreement forest for whose order is not larger than
               , where is the -forest in that has the
               largest order; or report that no such a hard agreement forest exists.

Soft Maximum Agreement Forest problem (sMaf)
Input: A set of -forests, and a parameter
Output: a soft agreement forest for whose order is not larger than
               , where is the -forest in that has the
               largest order; or report that no such a soft agreement forest exists.

2.4 Siblings, Sibling-set, Sibling-pair

Two leaves of an -forest are siblings if they have a common parent. A sibling-set of is a set of leaves that are all siblings. A maximal sibling-set (MSS) of is a sibling-set such that the common parent of the leaves in has degree either if has no parent or if has a parent. A sibling-pair is an MSS that contains exact two leaves.

2.5 Label-set Isomorphism Property, Essential Edge-set

Two -forests and satisfy the label-set isomorphism property if for each connected component in , there is a connected component in such that . An instance of the hMaf (or sMaf) problem satisfies the label-set isomorphism property if any two -forests in the instance satisfy the label-set isomorphism property.

Given an -forest and a subset of edges in , denote by the -forest with the edges in removed (up to the forced contraction). The edge-set is an essential edge-set (ee-set) of if . Note that it is easy to test if an edge-set is an ee-set of the given -forest.

3 Instance Satisfying Label-set Isomorphism Property

The hMAF (or sMAF) for the -forests in an instance of the hMaf problem (or the sMaf problem), is simply called the MAF for the -forests in . This section and the following Sections 4-5 are for the hMaf problem.

Every MAF for the -forests in an instance of the hMaf problem corresponds to a unique minimum subgraph of , for , which consists of the paths in that connect the leaves in the same connected component in . Thus, for any edge in , without any confusion, we can simply say that is in or is not in the MAF , as long as is in or is not in the corresponding subgraph , respectively.

Given an instance of the hMaf problem. If does not satisfy the label-set isomorphism property, then two rules given in the following subsection can be applied to eliminate the difference among the label-sets of the connected components in the -forests in . Denote by the maximum order of an -forest in .

3.1 Two Rules

Reduction Rule 1. Let () be a subset of the connected components in the -forest , . If there is a vertex in a connected component of the -forest , , such that , then remove the edge between and ’s parent (if one exists) in .

For the situation of Reduction Rule 1, we say that Reduction Rule 1 is applicable on relative to . Let be the instance obtained by applying Reduction Rule 1 on with edge removed from . By the formulation of the hMaf problem given in the previous section, we have that . Thus, if , then and , otherwise, . For instances and , we have the following lemma.

Lemma 3.1

Instances and have the same collection of solutions.


Proof. Firstly, we show that every agreement forest for is also an agreement forest for . Suppose is an agreement forest for . Let and . Since is a subforest of , for each connected component in , , we have that any label of cannot be in the same connected component with any label of in . Thus, any label of cannot be in the same connected component with any label of in .

Suppose that edge is in . Then there would exist a path in that connects a label of and a label of , contradicting the fact that any label of cannot be in the same connected component with any label of in . Thus, edge cannot be in and is still a subforest of . Therefore, is also an agreement forest for .

In the following, we show that every agreement forest for is also an agreement forest for . Suppose that is an agreement forest for . Since is a subforest of , is also a subforest of . Therefore, is also an agreement forest for .

By above analysis, and have the same collection of agreement forests. Since , -forest is a solution of if and only if is also a solution of .       

In the following discussion, we assume that Reduction Rule 1 is not applicable on the given instances. Our second rule is a branching rule. We first give some related definitions. We say that a branching rule is safe if on an instance it produces a collection of instances such that is a yes-instance if and only if at least one of the instances in is a yes-instance. A branching rule satisfies the recurrence relation if on an instance , it produces instances , , . We also say that the branching rule satisfies the recurrence relation () if the positive root of the characteristic polynomial of is not larger than that of (see [42] for more discussions). Moreover, we assume that the function is non-decreasing.

Case 1. For a connected component in , , there exists a vertex with two children and in the connected component of , , such that and .

Branching Rule 1. Branch into two ways: [1] remove the edge in ; [2] remove the edge in .














Figure 1: The general structure of connected component . The triangles and circles denote subtrees. The label-sets of and belong to , where . The label-sets of and do not belong to , where .

Figure 1 gives an illustration of Case 1, for which we will say that Branching Rule 1 is applicable on relative to . It is necessary to remark that there exists at least one label in that is in the connected component – otherwise, the edge could be removed by Reduction Rule 1. We have the following two observations for Case 1.

Observation 1

For each of the two edges and , there are two labels such that the edge is on the path connecting the two labels in , and the two labels are in the same connected component of .

Observation 2

For any -forest in , , there are two labels and that are in the same connected component of . There are also two labels and that are in the same connected component of .

Lemma 3.2

Branching Rule 1 is safe.


Proof. Let be an agreement forest for the -forests in . If both edges and are in , then there would be a label in and a label in that are in the same connected component of . However, this is impossible because is a connected component of and is a subforest of , so a connected component of cannot have both labels in and labels in .

Thus, at least one of the edges and is not in , which is an arbitrary agreement forest for . Consequently, at least one of the two branches in Branching Rule 1 is correct. Thus, the rule is safe.       

Lemma 3.3

Any instance of the hMaf problem on which Reduction Rule 1 and Branching Rule 1 are unapplicable, satisfies the label-set isomorphism property.


Proof. It suffices to prove that if neither of Reduction Rule 1 and Branching Rule 1 is applicable on any one of the two -forests and relative to the other, then and satisfy the label-set isomorphism property. Suppose for the contrary that there are two connected components and of and respectively, such that and .

(1). Suppose that one of and is a proper subset of the other. Because of the symmetry, we can assume . If there is a in such that , then the edge between and the parent of would be removed by Reduction Rule 1, contradicting the assumption that Reduction Rule 1 is not applicable on relative to . If there is no such a vertex , then there must be a vertex with two children and in such that and . Then, the edges and would be removed by Branching Rule 1, contradicting the assumption that Branching Rule 1 is not applicable on relative to .

(2). If neither of and is a proper subset of the other, then there is a vertex with two children and in such that and . If , then the edge would be removed by Reduction Rule 1; if , then the edge would be removed by Reduction Rule 1. If neither of these is the case, then the edges and would be removed by Branching Rule 1. Thus, all cases would contradict the assumption of the lemma.

Summarizing the above discussions gives the proof of the lemma.       

Let be an arbitrary instance of the hMaf problem on which Reduction Rule 1 is not applicable. If does not satisfy the label-set isomorphism property, then by Lemma 3.3, Branching Rule 1 can be applied, resulting in two instances. If the resulting instances do not satisfy the label-set isomorphism property, then we can recursively apply Reduction Rule 1 and Branching Rule 1, repeatedly, until all instances constructed in this process satisfy the label-set isomorphism property.

Let be any of these constructed instances. It is critical for us to know how many times Branching Rule 1 is applied in the process from to . To answer this is not easy because Branching Rule 1 can remove edges from different -forests in the instance. In the following, we first analyze a special process for two -forests and () in the instance , which is called the 2-BR-process on and . Note that Reduction Rule 1 is assumed not applicable on . The 2-BR-process on and consists of the following three stages. Initialize the collection with .

Stage-1. For an instance in , if Branching Rule 1 is applicable on relative to (or on relative to ), then apply Branching Rule 1 on and , and replace the instance in with the two instances resulted from the application of the rule.

Stage-2. For an instance in , if Reduction Rule 1 is applicable, then repeatedly apply Reduction Rule 1 until the rule is not applicable. Replace the instance in with the resulting instance.

Stage-3. Repeatedly apply Stage-1 and Stage-2, in this order, on any instance in in which and do not satisfy the label-set isomorphism property.

At the end of this process on and , in every instance in , the -forests and satisfy the label-set isomorphism property.

3.2 2-BR-process on and

Let be an instance obtained by the 2-BR-process on and of , and let () be the sequence of edges removed by Reduction Rule 1 and Branching Rule 1 during the 2-BR-process on and from to , in which , , is an edge of the instance . Let be the subsequence of that contains all edges removed by Branching Rule 1.

Since and satisfy the label-set isomorphism property, . We study the relations among , , , and . By Observation 1, for an edge in , there is a label-pair such that is on the unique path connecting and in (or ), and and are also in the same connected component of (or ). We call the label-pair a connected label-pair for the edge .

We explain how to find a connected label-pair for the edge . Without loss of generality, assume that is in , where is the parent of . For each connected component of , check if and , where is the connected component of containing the edge . Note that there must be a connected component in that satisfies these conditions – otherwise, Reduction Rule 1 would be applicable on the vertex in relative to . Now arbitrarily picking two labels from and , respectively, will give a connected label-pair for .

Let be the sequence that contains a connected label-pair for each edge in . Since there may be a label that appears in more than one connected label-pair in , for the simplicity of analysis, we construct several dummy labels for it. For example, if label appears in three connected label-pairs in , then we construct three dummy labels , , and for it, and replace the label in the three connected label-pairs with , , and , respectively. By this operation, each (dummy) label appears in only one connected label-pair in . We also say that is the dummy label of itself if label appears in only one connected label-pair in .

Let be the set of the dummy labels that appear in . We say that the label is in if some dummy label of is in , that the connected component of an -forest contains the dummy label if is a dummy label of some label in , and that two dummy labels and of are not in the same connected component of an -forest if labels and are not in the same connected component of , where and are the dummy labels of and , respectively.

Lemma 3.4

.


Proof. Given a connected component of an -forest , denote by the subset of such that for each label , if is in , then contains all dummy labels of that are in – otherwise, does not contain any dummy label of . Note that all dummy labels of a label are always in the same connected component of , hence either contains all dummy labels of the label or contains none.

Let , , be the connected components of . If for all , then the lemma obviously holds true, since the two dummy labels of each connected label-pair are in the same connected component of . Thus, we can assume that there is a connected component of such that .

By symmetry, we can assume , so the lemma claims . For an -forest , denote by the number of connected components of that contain dummy labels in , and by the number of connected components of that do not contain any dummy label in . Then , , and the lemma can be proved by showing

(1)

We prove the inequality (1) by induction on . If , then and . Since , the inequality (1) holds true for .

Now consider and . Since and are in the same connected component of and are in different connected components of , and . Combining this with the fact gives the inequality (1) when .

For the general case , where , let , and let be the connected label-pair for . Let and . Since for any , we have

Since we assumed that Reduction Rule 1 is not applicable on , the first edge in is also the first edge in , i.e., , and .

By the inductive hypothesis for , we have the following inequality for , , , and :

(2)

We divide into two cases