New Algorithms for Unordered Tree Inclusion
The tree inclusion problem is, given two node-labeled trees and (the “pattern tree” and the “text tree”), to locate every minimal subtree in (if any) that can be obtained by applying a sequence of node insertion operations to . The ordered tree inclusion problem is known to be solvable in polynomial time while the unordered tree inclusion problem is NP-hard. The currently fastest algorithm for the latter is from 1995 and runs in time, where and are the sizes of the pattern and text trees, respectively, and is the degree of the pattern tree. Here, we develop a new algorithm that improves the exponent to by considering a particular type of ancestor-descendant relationships and applying dynamic programming, thus reducing the time complexity to . We then study restricted variants of the unordered tree inclusion problem where the number of occurrences of different node labels and/or the input trees’ heights are bounded and show that although the problem remains NP-hard in many such cases, if the leaves of are distinctly labeled and each label occurs at most times in then it can be solved in polynomial time for and in time for .
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 6110011, Japan.
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China.
Content and Media Science Research Division, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan.
keywords: algorithm, tree inclusion, unordered tree, dynamic programming, ancestor-descendant relationship
Tree pattern matching and measuring the similarity of trees are classic problem areas in theoretical computer science. One intuitive and extensively studied measure of the similarity between two rooted, node-labeled trees and is the tree edit distance, defined as the length of a shortest sequence of node insertion, node deletion, and node relabeling operations that transforms into . When and are ordered trees, the tree edit distance can be computed in polynomial time. The first algorithm to achieve this bound ran in time , where is the total number of nodes in and , and it was gradually improved upon until Demaine et al.  presented an -time algorithm thirty years later which was proved to be worst-case optimal among a reasonable class of algorithms. On the other hand, the tree edit distance problem is NP-hard for unordered trees . It is in fact MAX SNP-hard even for binary trees in the unordered case , which implies that it is unlikely to admit a polynomial-time approximation scheme. Akutsu et al. [1, 3] have developed efficient exponential-time algorithms for this problem variant. As for parameterized algorithms, Shasha et al.  developed an -time algorithm for the problem, where and are the number of leaves in and , respectively, and an -time algorithm for the unit-cost edit operation model, where is the edit distance, was given in . See  for a survey of many other related results.
An important special case of the tree edit distance problem known as the tree inclusion problem is obtained when only node insertion operations are allowed. This problem has applications to structured text databases and natural language processing [5, 11, 18]. Here, we assume the following formulation of the problem: given a “text tree” and a “pattern tree” , locate every minimal subtree in (if any) that can be obtained by applying a sequence of node insertion operations to . (Equivalently, one may define the tree inclusion problem so that only node deletion operations on are allowed.) For unordered trees, Kilpeläinen and Mannila  proved the problem to be NP-hard in general but solvable in polynomial time when the degree of the pattern tree is bounded from above by a constant. More precisely, the running time of their algorithm is time, where , , and is the degree of . Bille and Gørtz  gave a fast algorithm for the case of ordered trees, and Valiente  developed an efficient algorithm for a constrained version of the unordered case. Also note that the special case of the tree inclusion problem where node insertion operations are only allowed to insert new leaves corresponds to a subtree isomorphism problem, which can be solved in polynomial time for unordered trees . The extended tree inclusion problem, proposed in , is an optimization problem designed to make the problem more useful for practical tree pattern matching applications, e.g., involving glycan data from the KEGG database , weblogs data , and bibliographical data from ACM, DBLP, and Google Scholar . This problem asks for an optimal connected subgraph of (if any) that can be obtained by performing node insertion operations as well as node relabeling operations to while allowing non-uniform costs to be assigned to the different node operations; it was shown in  how to solve the unrooted version in time exponential in and how a further extension of the problem that also allows at most node deletion operations can be solved by an algorithm whose running time depends on .
1.1 Practical Applications
As the rapid advance of AI technology, matching methods for knowledge base become more important. As a fundamental technique for searching knowledge base, researchers in database community have been studying the subtree similarity search. For example, Cohen and Or proposed subtree similarity search algorithm for various distance function , while Chang et al. proposed top-k tree matching algorithm . In Natural Language Processing (NLP) field, researchers are incorporating the deep learning techniques into NLP problems and developing parsing/dependency trees processing and matching problems . Bibliographic matching is one of the most popular applications of real-world matching problems . In most cases, single article has at most two or three versions, and it is very rare that single article includes the same name co-authors. Therefore, it may be reasonable to assume that the leaves of are distinctly labeled and each label occurs at most times in
1.2 New Results and Organization of the Paper
We improve the exponential contribution to the time complexity of the fastest known algorithm for the unordered tree inclusion problem (Kilpeläinen and Mannila’s algorithm from 1995 ) from to , where is the maximum degree of the pattern tree, so that the time complexity becomes . We then study the problem’s computational complexity for several restricted cases (see Table 1 for a summary) and give a polynomial-time algorithm for when the leaves in are distinctly labeled and every label appears at most twice in . Finally, we derive an -time algorithm for the NP-hard case where the leaves in are distinctly labeled and each label appears at most three times in .
|, , ,||all nodes||NP-hard||Corollary 1|
|, , ,||leaves||NP-hard||Theorem 2|
|,||all nodes||P||Theorem 3|
|,||all nodes||time||Theorem 4|
The paper is organized as follows. Section 2 defines the unordered tree inclusion problem and the concept of minimality, and explains the basic ideas related to the ancestor-descendant relationship. In Section 3, we utilize the ancestor-descendant relationships and dynamic programming to obtain the exponential-factor speedup. Section 4 presents the NP-hardness results for the special cases listed in Table 1. Finally, the polynomial- and exponential-time algorithms for when the leaves in are distinctly labeled and each label appears at most two or three times are developed in Sections 5 and 6, respectively.
From here on, all trees are rooted, unordered, and node-labeled. Let be a tree. A node insertion operation on is an operation that creates a new node having any label and then: (i) attaches as a child of some node currently in and makes become the parent of a (possibly empty) subset of the children of instead of ; or (ii) makes the current root of become a child of and lets become the new root. For any two trees and , we say that is included in if there exists a sequence of node insertion operations such that applying to yields .
The set of vertices in a tree is denoted by . A mapping between two trees and is a subset such that for every , it holds that: (i) if and only if ; and (ii) is an ancestor of if and only if is an ancestor of . is included in if and only if there is a mapping between and such that and and have the same node label for every .
In the tree inclusion problem, the input is two trees and (also referred to as the “pattern tree” and the “text tree”), and the objective is to determine if is included in . Define and , and denote the maximum outdegree of . For any node , let and denote its label and the set of its children. Also let and denote the sets of strict ancestors and strict descendants of , respectively, i.e., where itself is excluded from these sets. For a tree , and denote its root and the set of nodes in . For a node in a tree , is the subtree of induced by . We write if is included in under the condition that corresponds to . For two trees and , denotes that is isomorphic to .
The following concept plays a key role in our algorithm.
We say that minimally includes (denoted as ) if holds and there is no such that .
Let . holds if and only if the following conditions are satisfied.
has a set of descendants such that for all .
There exists a bijection from to such that holds for all .
Conditions (1) and (2) are obvious. To prove (3), suppose there exists a bijection from to such that holds for all and does not hold for some . Then, there must exist such that holds. Let be the bijection obtained by replacing a mapping from to with that from to . Clearly, gives an inclusion mapping. Repeatedly applying this procedure, we can obtain a bijection satisfying all conditions. ∎
Since is included in if and only if there exists such that , we focus on how to decide if assuming that whether holds is known for all with , , and . We have:
Suppose that can be decided in time. Then the unordered tree inclusion problem can be solved in time by using a bottom-up dynamic programming procedure.
3 An -Time Algorithm
The crucial parts of the algorithm in  are the definition of and its computation. (for fixed ) was defined by
where is the forest induced by nodes in and their descendants and means that forest is included in (i.e., can be obtained from by node insertion operations). Clearly, the size of is no greater than . In the algorithm of , the following operation is performed from left to right among the children of :
which causes an factor because it examines set pairs. Therefore, we need to avoid this kind of operation.
Given an unordered tree , we fix any left-to-right ordering of its nodes. Then, for any two nodes that do not have any ancestor-descendant relationship, either “ is left of ” or “ is right of ” is uniquely determined. We denote “ is left of ” by .
We focus on deciding if holds for fixed . Assume w.l.o.g. that . For simplicity, we assume until the end of this section that does not hold for any . For any , define by
where is the forest induced by nodes in and their descendants. Note that always holds. The definition of leads to a dynamic programming procedure for its computation. We explain and related concepts using an example in Figure 1. Suppose that we have the following relations.
Then, the following holds.
Let and . Let be an injection from to giving an inclusion mapping for . Let , where . Then, and hold for all . Furthermore, holds for . Therefore, .
It is straightforward to see that does not contain any element not in . ∎
We construct a DAG (directed acyclic graph) from (see also Figure 2). is defined by , and is defined by . Then, we traverse so that node is visited only after its all of its predecessors are visited. Let denote the set of the predecessors of (i.e., is the set of nodes left of ). Recall that .
Then, we compute by the following procedure, which is referred to as AlgInc1.
If , we let . Finally, we let . Then, iff and have the same label and .
AlgInc1 correctly computes s in time.
Since it is straightforward to prove the correctness, we analyze the time complexity. The sizes of , s, and s are , and computation of each of such sets can be done in time. Since the number of s and s are , the total computation time is . ∎
If there exist such that , we treat each element in , s, and s as a multiset where each pair of and such that are identified and the multiplicity of is bounded by the number of s isomorphic to . Then, the size of each multiset is at most and the number of different multisets is not greater than . Therefore, the same time complexity result holds. This discussion can also be applied to the following sections.
AlgInc1 did a lot of redundant computations. In order to compute , we do not need to consider all s that are left of . Instead, we construct a tree from a given by the following rule (see also Figure 3):
for each pair of consecutive siblings in , add a new sibling (leaf) between and .
Newly added nodes are called virtual nodes. We construct a DAG on by: iff one of the following holds
is a virtual node, and is in the rightmost path of , where .
is a virtual node, and is in the leftmost path of , where .
Then, we can use the same algorithm as AlgInc1, except that is replaced by . We denote the resulting algorithm by AlgInc2.
AlgInc2 correctly computes s in time.
Since it is straightforward to see the correctness, we analyze the time complexity.
We can see that is since
Each non-virtual node in has at most one incoming edge and at most one outgoing edge,
Each edge connects non-virtual node and virtual node.
Therefore, the total number of set operations is reduced to , from which the lemma follows. ∎
From Proposition 2, we have:
Unordered tree inclusion can be solved in time.
If we analyze the time complexity carefully, we can see that the total time complexity is , where is the height of because each is involved in computation of only for .
4 NP-Hardness of Unordered Tree Inclusion for Pattern Trees with Unique Leaf Labels
For any node-labeled tree , let be the height of and let be the set of all leaf labels in . For any , let be the number of times that occurs in , and define .
The decision version of the tree inclusion problem is to determine whether can be obtained from by applying node insertion operations. Kilpeläinen and Mannila  proved that the decision version of unordered tree inclusion is NP-complete by reducing from Satisfiability. In their reduction, the clauses in a given instance of Satisfiability are represented by node labels in the constructed trees; in particular, for every clause , each literal in introduces one node in whose node label represents . By modifying their reduction to assume that each clause contains exactly three literals (i.e., using 3SAT instead of Satisfiability), we immediately have:
The decision version of the unordered tree inclusion problem is NP-complete even if restricted to instances where , , , and .
In Kilpeläinen and Mannila’s reduction, the labels assigned to the internal nodes of are significant. Below, we consider the computational complexity of the special case of the problem where all internal nodes in and have the same label, or equivalently, where only the leaves are labeled.
The following problem is known to be NP-complete :
Exact Cover by 3-Sets (X3C): Given a set and a collection of subsets of where for every and every belongs to at most three subsets in , does admit an exact cover, i.e., is there a such that and ?
From here on, assume w.l.o.g. that in any given instance of X3C, is an integer and each belongs to at least one subset in .
The decision version of the unordered tree inclusion problem is NP-complete even if restricted to instances where , , , , and all internal nodes have the same label.
Membership in NP follows from the proof of Theorem 7.3 in .
To prove NP-hardness, we reduce from X3C. Given an instance of X3C, construct two node-labeled, unordered trees and as follows. (Refer to Figure 4 for an example of the reduction.) Let be a set of elements different from , define , and let be an element not in . For any , let denote the height- unordered tree consisting of a root node labeled by whose children are bijectively labeled by . Construct by creating a node labeled by and attaching the roots of the following trees as children of :
for each ,
Construct by taking a copy of and then, for each , attaching the root of as a child of the root of . Note that by construction, , , , , and hold.
We now show that is included in if and only if admits an exact cover. First, suppose that admits an exact cover . Then is included in because all leaves of labeled by can be mapped to the -subtrees in for , while of the leaves labeled by can be mapped to the remaining -subtrees and each of the other leaves with labels from can be mapped to one of the - and -subtrees. Next, suppose that is included in . By the definitions of and , each subtree rooted at a child of can have at most one leaf with a label in or at most three leaves with labels in mapped to it from . Since but there are only subtrees in of the form and , at least subtrees of the form must have a leaf with a label from mapped to them. This means that at most subtrees of the form remain for the leaves in labeled by to be mapped to, and hence, exactly such subtrees have to be used. Denote these subtrees by , , , . Then is an exact cover of . ∎
5 A Polynomial-Time Algorithm for the Case of
In the following, we require that each leaf of has a unique label and that it appears at no more than leaves in . We denote this number by .
We write if is included in under the condition that corresponds to , where denotes the subtree of induced by and its descendants. Then, the following (#) is the crucial part (exponential-time part):
Assume w.l.o.g. that has the same label as . Let be the children of . Then, if and only if holds for all for some nodes each pair of which does not have an ancestor-descendant relationship.
From the assumption, we have the following observation.
Suppose that has a leaf labeled with . If , then is an ancestor of a leaf (or leaf itself) with label .
From (#) and this proposition, for each , we only need to consider minimal nodes s such that , where ‘minimal’ means that there is no descendant of such that , It is easy to see that the number of such minimal nodes is at most for each if . If is such a minimal node, we write .
As illustrated in Figure 5, we can have a chain of choices of the subtrees of in . (E.g., if we choose , then we cannot choose . Therefore, we need to choose . If we choose , then we cannot choose . Etc.) This suggests that 2-SAT may be useful. We have:
Unordered tree inclusion can be solved in polynomial time if .
We prove the theorem by using a reduction to 2-SAT. Let . Assume by induction that we know . We define by
See Figure 6 for an illustration. We assume w.l.o.g. that for all . Associate a Boolean variable to each element and include the following constraints:
and for each , where ().
It means that is mapped to exactly one of or .
(Recall that we assume for all .)
for each pair such that holds or and have an ancestor-descendant relationship.
It means that the condition of (#) must be satisfied.
Then, this 2-SAT instance is satisfiable iff holds. Since 2-SAT is solvable in polynomial time, we have the theorem. ∎
6 An -time Algorithm for the Case of
In this section, we present an time algorithm for the case of , where is the maximum outdegree of , , and .
The basic strategy is use of dynamic programming: decide whether in a bottom-up way. Suppose that has a set of children . Since we use dynamic programming, we can assume that is known for all and for all . We define by
The crucial task of the dynamic programming procedure is to find an injective mapping from to such that holds for all () and there is no ancestor/descendant relationship between any and (). If this task can be performed in time, the total complexity will be . We assume w.l.o.g. that is given as a set of mapping pairs. For , we define by
where (resp., ) denotes the set of ancestors (resp., descendants) of in where (resp., ).
Recall that is defined by
where . Let (resp., ) be the number of s such that (resp., ) (see also Figure 6). We assume w.l.o.g. that because means that is uniquely determined. From Theorem 3, we can see the following if there is no pair such that , , and .
The problem can be solved in time:
For each such that (i.e., ), we choose (i.e., ) or not. Thus, there exist possibilities. After each choice, there is no such that and Theorem 3 can be applied.
The problem can also be solved in time:
For each with (i.e., ), we choose or not. Thus, there are possibilities and after each choice, each with is removed or the problem can be reduced to bipartite matching as shown in Figure 7.
It means the problem can be solved in time. We denote the condition (i.e., ‘if’ part of the above) and this algorithm by (##) and ALG-##, respectively, Therefore, the crucial point is how to (recursively) remove pairs such that , , and .