New Algorithms for Unordered Tree Inclusion

New Algorithms for Unordered Tree Inclusion

Tatsuya Akutsu, Jesper Jansson, Ruiming Li, Atsuhiro Takasu,
and Takeyuki Tamura                                                           
Abstract

The tree inclusion problem is, given two node-labeled trees  and  (the “pattern tree” and the “text tree”), to locate every minimal subtree in  (if any) that can be obtained by applying a sequence of node insertion operations to . The ordered tree inclusion problem is known to be solvable in polynomial time while the unordered tree inclusion problem is NP-hard. The currently fastest algorithm for the latter is from 1995 and runs in time, where and are the sizes of the pattern and text trees, respectively, and is the degree of the pattern tree. Here, we develop a new algorithm that improves the exponent to  by considering a particular type of ancestor-descendant relationships and applying dynamic programming, thus reducing the time complexity to . We then study restricted variants of the unordered tree inclusion problem where the number of occurrences of different node labels and/or the input trees’ heights are bounded and show that although the problem remains NP-hard in many such cases, if the leaves of  are distinctly labeled and each label occurs at most times in  then it can be solved in polynomial time for and in time for .

  1. Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 6110011, Japan.
    {takutsu,rmli,tamura}@kuicr.kyoto-u.ac.jp

  2. Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China.
    jesper.jansson@polyu.edu.hk

  3. Content and Media Science Research Division, National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan.
    takasu@nii.ac.jp

keywords: algorithm, tree inclusion, unordered tree, dynamic programming, ancestor-descendant relationship

1 Introduction

Tree pattern matching and measuring the similarity of trees are classic problem areas in theoretical computer science. One intuitive and extensively studied measure of the similarity between two rooted, node-labeled trees  and  is the tree edit distance, defined as the length of a shortest sequence of node insertion, node deletion, and node relabeling operations that transforms  into . When and  are ordered trees, the tree edit distance can be computed in polynomial time. The first algorithm to achieve this bound ran in time [17], where is the total number of nodes in  and , and it was gradually improved upon until Demaine et al. [8] presented an -time algorithm thirty years later which was proved to be worst-case optimal among a reasonable class of algorithms. On the other hand, the tree edit distance problem is NP-hard for unordered trees [21]. It is in fact MAX SNP-hard even for binary trees in the unordered case [20], which implies that it is unlikely to admit a polynomial-time approximation scheme. Akutsu et al. [1, 3] have developed efficient exponential-time algorithms for this problem variant. As for parameterized algorithms, Shasha et al. [16] developed an -time algorithm for the problem, where and are the number of leaves in and , respectively, and an -time algorithm for the unit-cost edit operation model, where is the edit distance, was given in [2]. See [4] for a survey of many other related results.

An important special case of the tree edit distance problem known as the tree inclusion problem is obtained when only node insertion operations are allowed. This problem has applications to structured text databases and natural language processing [5, 11, 18]. Here, we assume the following formulation of the problem: given a “text tree”  and a “pattern tree” , locate every minimal subtree in  (if any) that can be obtained by applying a sequence of node insertion operations to . (Equivalently, one may define the tree inclusion problem so that only node deletion operations on  are allowed.) For unordered trees, Kilpeläinen and Mannila [11] proved the problem to be NP-hard in general but solvable in polynomial time when the degree of the pattern tree is bounded from above by a constant. More precisely, the running time of their algorithm is time, where , , and is the degree of . Bille and Gørtz [5] gave a fast algorithm for the case of ordered trees, and Valiente [18] developed an efficient algorithm for a constrained version of the unordered case. Also note that the special case of the tree inclusion problem where node insertion operations are only allowed to insert new leaves corresponds to a subtree isomorphism problem, which can be solved in polynomial time for unordered trees [14]. The extended tree inclusion problem, proposed in [15], is an optimization problem designed to make the problem more useful for practical tree pattern matching applications, e.g., involving glycan data from the KEGG database [10], weblogs data [19], and bibliographical data from ACM, DBLP, and Google Scholar [12]. This problem asks for an optimal connected subgraph of  (if any) that can be obtained by performing node insertion operations as well as node relabeling operations to  while allowing non-uniform costs to be assigned to the different node operations; it was shown in [15] how to solve the unrooted version in time exponential in  and how a further extension of the problem that also allows at most  node deletion operations can be solved by an algorithm whose running time depends on .

1.1 Practical Applications

As the rapid advance of AI technology, matching methods for knowledge base become more important. As a fundamental technique for searching knowledge base, researchers in database community have been studying the subtree similarity search. For example, Cohen and Or proposed subtree similarity search algorithm for various distance function [7], while Chang et al. proposed top-k tree matching algorithm [6]. In Natural Language Processing (NLP) field, researchers are incorporating the deep learning techniques into NLP problems and developing parsing/dependency trees processing and matching problems [13]. Bibliographic matching is one of the most popular applications of real-world matching problems [12]. In most cases, single article has at most two or three versions, and it is very rare that single article includes the same name co-authors. Therefore, it may be reasonable to assume that the leaves of  are distinctly labeled and each label occurs at most times in 

1.2 New Results and Organization of the Paper

We improve the exponential contribution to the time complexity of the fastest known algorithm for the unordered tree inclusion problem (Kilpeläinen and Mannila’s algorithm from 1995 [11]) from  to , where is the maximum degree of the pattern tree, so that the time complexity becomes . We then study the problem’s computational complexity for several restricted cases (see Table 1 for a summary) and give a polynomial-time algorithm for when the leaves in  are distinctly labeled and every label appears at most twice in . Finally, we derive an -time algorithm for the NP-hard case where the leaves in  are distinctly labeled and each label appears at most three times in .

Restriction Labels on Complexity Reference
, , , all nodes NP-hard Corollary 1
, , , leaves NP-hard Theorem 2
, all nodes P Theorem 3
, all nodes time Theorem 4
Table 1: The computational complexity of some special cases of the unordered tree inclusion problem. For any tree ,  denotes the height of  and the maximum number of times that any node label occurs in . As indicated in the table, either all nodes or only the leaves are labeled (the former is harder since it generalizes the latter). Note that the last case is also NP-hard as it is a generalization of the first two cases.

The paper is organized as follows. Section 2 defines the unordered tree inclusion problem and the concept of minimality, and explains the basic ideas related to the ancestor-descendant relationship. In Section 3, we utilize the ancestor-descendant relationships and dynamic programming to obtain the exponential-factor speedup. Section 4 presents the NP-hardness results for the special cases listed in Table 1. Finally, the polynomial- and exponential-time algorithms for when the leaves in  are distinctly labeled and each label appears at most two or three times are developed in Sections 5 and 6, respectively.

2 Preliminaries

From here on, all trees are rooted, unordered, and node-labeled. Let  be a tree. A node insertion operation on  is an operation that creates a new node  having any label and then: (i) attaches  as a child of some node  currently in  and makes become the parent of a (possibly empty) subset of the children of  instead of ; or (ii) makes the current root of  become a child of  and lets become the new root. For any two trees  and , we say that  is included in  if there exists a sequence  of node insertion operations such that applying  to  yields .

The set of vertices in a tree  is denoted by . A mapping between two trees  and  is a subset such that for every , it holds that: (i)  if and only if ; and (ii)  is an ancestor of  if and only if is an ancestor of .  is included in  if and only if there is a mapping  between  and  such that and and have the same node label for every  [17].

In the tree inclusion problem, the input is two trees  and  (also referred to as the “pattern tree” and the “text tree”), and the objective is to determine if is included in . Define and , and denote the maximum outdegree of . For any node , let and denote its label and the set of its children. Also let and denote the sets of strict ancestors and strict descendants of , respectively, i.e., where itself is excluded from these sets. For a tree , and denote its root and the set of nodes in . For a node in a tree , is the subtree of induced by . We write if is included in under the condition that corresponds to . For two trees and , denotes that is isomorphic to .

The following concept plays a key role in our algorithm.

Definition 1.

We say that minimally includes (denoted as ) if holds and there is no such that .

Proposition 1.

Let . holds if and only if the following conditions are satisfied.

  • .

  • has a set of descendants such that for all .

  • There exists a bijection from to such that holds for all .

Proof.

Conditions (1) and (2) are obvious. To prove (3), suppose there exists a bijection from to such that holds for all and does not hold for some . Then, there must exist such that holds. Let be the bijection obtained by replacing a mapping from to with that from to . Clearly, gives an inclusion mapping. Repeatedly applying this procedure, we can obtain a bijection satisfying all conditions. ∎

Since is included in if and only if there exists such that , we focus on how to decide if assuming that whether holds is known for all with , , and . We have:

Proposition 2.

Suppose that can be decided in time. Then the unordered tree inclusion problem can be solved in time by using a bottom-up dynamic programming procedure.

3 An -Time Algorithm

The crucial parts of the algorithm in [11] are the definition of and its computation. (for fixed ) was defined by

where is the forest induced by nodes in and their descendants and means that forest is included in (i.e., can be obtained from  by node insertion operations). Clearly, the size of is no greater than . In the algorithm of [11], the following operation is performed from left to right among the children of :

which causes an factor because it examines set pairs. Therefore, we need to avoid this kind of operation.

Given an unordered tree , we fix any left-to-right ordering of its nodes. Then, for any two nodes that do not have any ancestor-descendant relationship, either “ is left of ” or “ is right of ” is uniquely determined. We denote “ is left of ” by .

We focus on deciding if holds for fixed . Assume w.l.o.g. that . For simplicity, we assume until the end of this section that does not hold for any . For any , define by

For example, , , and in Figure 1. For any , denotes the set of nodes in each of which is left of (see Figure 1 for an example). Then, we define by

where is the forest induced by nodes in and their descendants. Note that always holds. The definition of leads to a dynamic programming procedure for its computation. We explain and related concepts using an example in Figure 1. Suppose that we have the following relations.

Then, the following holds.

Figure 1: Example for explaining the key idea. A triangle attached to means that holds. Note that triangle appears at , , and . However, does not hold since it does not satisfy the minimality condition. Therefore, is never selected for matching to in AlgInc1: if we need to match to , we can instead use a matching between and .
Proposition 3.

.

Proof.

Let and . Let be an injection from to giving an inclusion mapping for . Let , where . Then, and hold for all . Furthermore, holds for . Therefore, .

It is straightforward to see that does not contain any element not in . ∎

We construct a DAG (directed acyclic graph) from (see also Figure 2). is defined by , and is defined by . Then, we traverse so that node is visited only after its all of its predecessors are visited. Let denote the set of the predecessors of (i.e., is the set of nodes left of ). Recall that .

Figure 2: Example of a DAG constructed from , where . is shown by dashed arrows and is shown by bold lines.

Then, we compute by the following procedure, which is referred to as AlgInc1.

  • .

  • .

If , we let . Finally, we let . Then, iff and have the same label and .

Lemma 1.

AlgInc1 correctly computes s in time.

Proof.

Since it is straightforward to prove the correctness, we analyze the time complexity. The sizes of , s, and s are , and computation of each of such sets can be done in time. Since the number of s and s are , the total computation time is . ∎

If there exist such that , we treat each element in , s, and s as a multiset where each pair of and such that are identified and the multiplicity of is bounded by the number of s isomorphic to . Then, the size of each multiset is at most and the number of different multisets is not greater than . Therefore, the same time complexity result holds. This discussion can also be applied to the following sections.

AlgInc1 did a lot of redundant computations. In order to compute , we do not need to consider all s that are left of . Instead, we construct a tree from a given by the following rule (see also Figure 3):

for each pair of consecutive siblings in , add a new sibling (leaf) between and .

Newly added nodes are called virtual nodes. We construct a DAG on by: iff one of the following holds

  • is a virtual node, and is in the rightmost path of , where .

  • is a virtual node, and is in the leftmost path of , where .

Then, we can use the same algorithm as AlgInc1, except that is replaced by . We denote the resulting algorithm by AlgInc2.

Lemma 2.

AlgInc2 correctly computes s in time.

Proof.

Since it is straightforward to see the correctness, we analyze the time complexity.

We can see that is since

  • is ,

  • Each non-virtual node in has at most one incoming edge and at most one outgoing edge,

  • Each edge connects non-virtual node and virtual node.

Therefore, the total number of set operations is reduced to , from which the lemma follows. ∎

From Proposition 2, we have:

Theorem 1.

Unordered tree inclusion can be solved in time.

Figure 3: Example of and . is shown by dashed arrows.

If we analyze the time complexity carefully, we can see that the total time complexity is , where is the height of because each is involved in computation of only for .

4 NP-Hardness of Unordered Tree Inclusion for Pattern Trees with Unique Leaf Labels

For any node-labeled tree , let be the height of  and let be the set of all leaf labels in . For any , let be the number of times that  occurs in , and define .

The decision version of the tree inclusion problem is to determine whether  can be obtained from  by applying node insertion operations. Kilpeläinen and Mannila [11] proved that the decision version of unordered tree inclusion is NP-complete by reducing from Satisfiability. In their reduction, the clauses in a given instance of Satisfiability are represented by node labels in the constructed trees; in particular, for every clause , each literal in  introduces one node in  whose node label represents . By modifying their reduction to assume that each clause contains exactly three literals (i.e., using 3SAT instead of Satisfiability), we immediately have:

Corollary 1.

The decision version of the unordered tree inclusion problem is NP-complete even if restricted to instances where , , , and .

In Kilpeläinen and Mannila’s reduction, the labels assigned to the internal nodes of  are significant. Below, we consider the computational complexity of the special case of the problem where all internal nodes in  and  have the same label, or equivalently, where only the leaves are labeled.

The following problem is known to be NP-complete [9]:

Exact Cover by 3-Sets (X3C): Given a set and a collection of subsets of  where for every and every belongs to at most three subsets in , does admit an exact cover, i.e., is there a such that and ?

From here on, assume w.l.o.g. that in any given instance of X3C,  is an integer and each belongs to at least one subset in .

Theorem 2.

The decision version of the unordered tree inclusion problem is NP-complete even if restricted to instances where , , , , and all internal nodes have the same label.

Proof.

Membership in NP follows from the proof of Theorem 7.3 in [11].

To prove NP-hardness, we reduce from X3C. Given an instance of X3C, construct two node-labeled, unordered trees  and  as follows. (Refer to Figure 4 for an example of the reduction.) Let be a set of elements different from , define , and let be an element not in . For any , let denote the height- unordered tree consisting of a root node labeled by  whose children are bijectively labeled by . Construct by creating a node  labeled by  and attaching the roots of the following trees as children of :

  • for each

  • for each ,

  • for each

Construct by taking a copy of  and then, for each , attaching the root of  as a child of the root of . Note that by construction, , , , , and hold.

We now show that is included in  if and only if admits an exact cover. First, suppose that admits an exact cover . Then is included in  because all leaves of  labeled by  can be mapped to the -subtrees in  for , while of the leaves labeled by can be mapped to the remaining -subtrees and each of the other leaves with labels from  can be mapped to one of the - and -subtrees. Next, suppose that is included in . By the definitions of  and , each subtree rooted at a child of  can have at most one leaf with a label in  or at most three leaves with labels in  mapped to it from . Since but there are only subtrees in  of the form and , at least subtrees of the form must have a leaf with a label from mapped to them. This means that at most subtrees of the form remain for the  leaves in  labeled by  to be mapped to, and hence, exactly such subtrees have to be used. Denote these subtrees by , , , . Then is an exact cover of . ∎

Figure 4: Illustrating the proof of Theorem 2. Suppose that and with is a given instance of X3C. Applying the reduction yields the shown trees  and . Here,  is included in  because all the leaves of  can be mapped to leaves in  as indicated by the rectangles, which gives the exact cover for .

5 A Polynomial-Time Algorithm for the Case of

In the following, we require that each leaf of has a unique label and that it appears at no more than leaves in . We denote this number by .

We write if is included in under the condition that corresponds to , where denotes the subtree of induced by and its descendants. Then, the following (#) is the crucial part (exponential-time part):

Assume w.l.o.g. that has the same label as . Let be the children of . Then, if and only if holds for all for some nodes each pair of which does not have an ancestor-descendant relationship.

From the assumption, we have the following observation.

Proposition 4.

Suppose that has a leaf labeled with . If , then is an ancestor of a leaf (or leaf itself) with label .

From (#) and this proposition, for each , we only need to consider minimal nodes s such that , where ‘minimal’ means that there is no descendant of such that , It is easy to see that the number of such minimal nodes is at most for each if . If is such a minimal node, we write .

As illustrated in Figure 5, we can have a chain of choices of the subtrees of in . (E.g., if we choose , then we cannot choose . Therefore, we need to choose . If we choose , then we cannot choose . Etc.) This suggests that 2-SAT may be useful. We have:

Theorem 3.

Unordered tree inclusion can be solved in polynomial time if .

Proof.

We prove the theorem by using a reduction to 2-SAT. Let . Assume by induction that we know . We define by

See Figure 6 for an illustration. We assume w.l.o.g. that for all . Associate a Boolean variable to each element and include the following constraints:

  • and for each , where ().
    It means that is mapped to exactly one of or .
    (Recall that we assume for all .)

  • for each pair such that holds or and have an ancestor-descendant relationship.
    It means that the condition of (#) must be satisfied.

Then, this 2-SAT instance is satisfiable iff holds. Since 2-SAT is solvable in polynomial time, we have the theorem. ∎

Figure 5: Illustration for Theorem 3.
Figure 6: For these trees, , , , , and ,

6 An -time Algorithm for the Case of

In this section, we present an time algorithm for the case of , where is the maximum outdegree of , , and .

The basic strategy is use of dynamic programming: decide whether in a bottom-up way. Suppose that has a set of children . Since we use dynamic programming, we can assume that is known for all and for all . We define by

The crucial task of the dynamic programming procedure is to find an injective mapping from to such that holds for all () and there is no ancestor/descendant relationship between any and (). If this task can be performed in time, the total complexity will be . We assume w.l.o.g. that is given as a set of mapping pairs. For , we define by

where (resp., ) denotes the set of ancestors (resp., descendants) of in where (resp., ).

Recall that is defined by

where . Let (resp., ) be the number of s such that (resp., ) (see also Figure 6). We assume w.l.o.g. that because means that is uniquely determined. From Theorem 3, we can see the following if there is no pair such that , , and .

  • The problem can be solved in time:
    For each such that (i.e., ), we choose (i.e., ) or not. Thus, there exist possibilities. After each choice, there is no such that and Theorem 3 can be applied.

  • The problem can also be solved in time:
    For each with (i.e., ), we choose or not. Thus, there are possibilities and after each choice, each with is removed or the problem can be reduced to bipartite matching as shown in Figure 7.

It means the problem can be solved in time. We denote the condition (i.e., ‘if’ part of the above) and this algorithm by (##) and ALG-##, respectively, Therefore, the crucial point is how to (recursively) remove pairs such that , , and .

Figure 7: Example of the reduction to bipartite matching when there is no pair such that , ,