Improved Methods for Computing Distances between Unordered Trees Using Integer Programming

Improved Methods for Computing Distances between Unordered Trees Using Integer Programming

Abstract

Kondo et al. (DS 2014) proposed methods for computing distances between unordered rooted trees by transforming an instance of the distance computing problem into an instance of the integer programming problem. They showed that the tree edit distance, segmental distance, and bottom-up segmental distance problem can be respectively transformed into an integer program which has variables and constraints, where and are the number of nodes of input trees. In this work, we propose new integer programming formulations for these three distances and the bottom-up distance by applying dynamic programming approach. We divide the tree edit distance problem into subproblems each of which has only constraints. For the other three distances, each subproblem can be reduced to a maximum weighted matching problem in a bipartite graph which can be solved in polynomial time. In order to evaluate our methods, we compare our method to the previous one due to Kondo et al. The experimental results show that the performance of our methods have been improved remarkably compared to that of the previous method.

\patchcmd

References

1 Introduction

In machine learning applications, it is important to compare (dis)similarities between tree-structured data such as XML and RNA secondary structures. There are many measures of similarities between two trees. The tree edit distance [16] is one of the most widely used measures, which is defined as the minimum cost of edit operations to transform a tree into another. It is equivalent to finding the maximum cost of a Tai mapping between two trees. However, the tree edit distance may not be appropriate to use in some applications where structure-sensitivity is required. In this context, many variants of Tai mapping have been proposed (see [12], for example). In this study, four measures are covered including the edit distance, segmental distance [9], bottom-up segmental distance [9] and bottom-up distance [17].

It is known that most of distances between ordered rooted trees can be computed in polynomial time. For example, Tai [16] showed that the tree edit distance between ordered rooted trees can be computed in time, where and are the number of nodes of input trees, and Demaine et al. [4] improved the running time to . However, if input trees are unordered, the problems of computing the above four distances are known to be not only NP-hard [20], but also MAX SNP-hard [9, 17, 19]. Akutsu et al. studied the tree edit distance problem between unordered trees from a theoretical algorithmic perspective. They gave an approximation algorithm and exact algorithms [1, 2, 3]. From the practical point of view, many researches have been done so far. Horesh et al. [7] proposed an A algorithm to solve this problem for unlabeled unordered trees and Higuchi et al. [6] extended it for labeled trees. Fukagawa et al. [5] proposed a method to reduce the edit distance problem into the maximum vertex weighted clique problem which can be solved by an algorithm due to [15]. They showed that the clique-based method is as fast as A*-based method. Mori et al. [14] improved it by applying a dynamic programming approach. They showed that their method is faster than the previous clique-based method. Kondo et al. [11] proposed a method to reduce an instance of the edit distance problem into an instance of integer linear programming (IP) problem with variables and constraints, where and are the number of nodes of input trees, respectively. However, the instance of their IP formulation has a large number of constraints and hence their method may not be applicable to moderate-sized instances. Although they showed that their method is faster than the clique-based method of Mori et al. [14] when input trees have large degree nodes, their IP-based method is not very effective when input trees have no large degree nodes or the size of tree is large.

An advantage of IP-based method is that we can easily make an IP formulation representing variations of the edit distance by adding some additional constraints. In fact, Kondo et al. showed IP formulations which represent segmental distance and bottom-up segmental distance by adding appropriate constraints. Another advantage of this method is that we can use state-of-the-art IP solvers (e.g. Gurobi, CPLEX), which can quickly solve many hard problems.

In this paper, we propose improved methods to compute the edit distance, segmental distance, bottom-up segmental distance and bottom-up distance between unordered rooted trees. The improvement of computational efficiency is obtained by applying a dynamic programming approach due to [14]. However, it is not only sufficient to apply the dynamic programming but it is necessary to use a structural property of rooted trees. Their dynamic programming with this property allows us to drastically reduce the number of constraints in our IP formulations for the above distances. For the edit distance problem, our method has to solve subproblems each of which has only constraints. For the other distances, each subproblem except the problem of combining the solutions of subproblems can be reduced to the maximum weighted matching problem in a bipartite graph, which can be solved in polynomial time using the Hungarian method [13].

The rest of the paper is organized as follows. We give notations and preliminary results in Sect. 2 and briefly explain the previous method in Sect. 3. In Sect. 4, we introduce our new methods. In order to evaluate our methods, we implemented previous and our methods and conducted experiment using Glycan dataset [10] and CSLOGS dataset [18]. The results of our experiments are shown in Sect. 5. Finally, we conclude our paper with some discussions.

2 Preliminaries

Let be a rooted tree. The root of is denoted by . In this paper, we simply write to represent the set of nodes of . For , means that is on the unique path between the root and . If and , we write and say that is an ancestor of and is a descendant of . It is easy to see that the relation is a partial order on . A parent of , denoted by , is the closest ancestor of . The children of , denoted by , is the set of the closest nodes to among the all descendants of . We call the number of children of the degree of . A node is called a leaf if it has no children. The set of all leaves of a tree is denoted by . Nodes and are siblings if they has the same parent. A tree is called unordered tree if there is no order between siblings. Let be a finite alphabet and a labeling function. A tuple is called a labeled tree. For , we use to denote the subtree of rooted at . For notational convenience, we simply write to denote the subgraph of obtained by removing a node .

2.1 Tree Edit Distance

The tree edit distance between two trees is defined as the minimum cost of edit operations to transform a tree into another.

Definition 1 (Edit Operations).

Let T be a tree. Edit operations on T consist of the following three operations.

Substitution

Replace the label of a node in with a new label.

Deletion

Delete a non-root node of , making all children of be the children of .

Insertion

Insert a new node as a child of some node in , making some children of be the children of .

Let , where is a blank symbol not in . In order to describe costs on edit operations, we denote each of the edit operations by a pair in . Substituting a node labeled with by another node labeled with is denoted by . Inserting a node labeled with is denoted by . Deleting a node labeled with is denoted by . Let be a cost function on edit operations and assume, in this paper, that is a metric. In the following, we simply write for to represent , where and are labeling functions on two trees and , respectively.

Let be a sequence of edit operations, where for . The cost of the sequence is defined as .

Definition 2 (Tree Edit Distance [16]).

Let and be trees and be the set of all sequences of edit operations which transform into . The tree edit distance between and , denoted by , is defined as

A mapping between and is a subset of . The set of nodes that belongs to a mapping is denoted by . Tai [16] gave a combinatorial characterization of the tree edit distance by means of a mapping, which is called a Tai mapping.

Definition 3 (Tai Mapping [16]).

Let and be trees. A mapping is called a Tai mapping if it satisfies the following constraints for every in :

One-to-one correspondence :

,

Preserving ancestor-descendant relationship:

.

The cost of a Tai mapping is defined as

Let be the set of all Tai mappings between and . Tai [16] showed the following theorem.

Theorem 1 ([16]).

For two trees and , .

2.2 Variants of Edit Distance

The tree edit distance is one of the most widely used to measure a similarity between two trees. However, it may not be appropriate for some applications because one may need a distance on which some specific structure of trees is reflected. Many variants of the tree edit distance have been proposed in the literature [9, 17]. We work on the following three variants, which are defined by mappings rather than edit operations.

Definition 4 (Segmental Mapping [9]).

Let and be trees. A Tai mapping between and is called a segmental mapping if for any with and , .

Definition 5 (Bottom-up Segmental Mapping [9]).

Let and be trees. A segmental mapping between and is called a bottom-up segmental mapping if for any , there is such that are leaves with and .

Definition 6 (Bottom-up Mapping [17]).

Let and be trees. A Tai mapping between and is called a bottom-up mapping if for any , the submapping obtained from by restricting to forms a bijection between and .

Let us note that the condition in Definition 6 can be restated in the following way: M is a bottom-up mapping if for any , the submapping obtained from by restricting to is an isomorphism mapping, ignoring the label information.

Definition 7 ([9, 17]).

Let and trees. Denote the sets of all possible segmental mappings, bottom-up segmental mappings, and bottom-up mappings between and by , and , respectively. The segmental distance, bottom-up segmental distance, and bottom-up distance between and , which are denoted by , and respectively, are defined as follows:

3 Previous Method [11]

In the rest of this paper, fix input trees and , and let and . Kondo et al. [11] proposed an integer linear programming formulation for the tree edit distance. For the tree edit distance between and , we introduce a binary variable for every which takes value 1 if and only if . Then, we can reformulate the cost of a Tai mapping as:

The two constraints of Tai mapping are directly formulated as the following inequalities:

The first two constraints are equivalent to the one-to-one correspondence of Tai mapping. It means that for any node (resp. ), at most one node of (resp. ) is allowed to be paired. The third constraint is equivalent to the ancestor-descendant preservation. It means that for any two pairs which do not preserve the ancestor-descendant relationship, both of them cannot be included in simultaneously. This formulation contains variables and constraints.

Kondo et al. also gave IP formulations for the segmental distance and bottom-up segmental distance. These distances can be formulated by imposing additional constraints on the above formulation. In regard of the segmental mapping, the constraints of segmental mapping can be represented as follows:

The constraints of bottom-up segmental mapping can also be represented as follows:

The above two formulations also contain variables and constraints.

4 Improved Method

4.1 Improved Method for Tree Edit Distance

In this section, we propose a new IP formulation for the edit distance problem by combining a dynamic programming approach due to [14]. The dynamic programming computes a minimum cost Tai mapping between and with for in a bottom-up manner. Once we have the solutions for all pairs , we can construct a minimum cost Tai mapping between and .

First, we modify the objective function

to

where . This modification is valid since the second and third terms do not affect the minimization.

Since the solution of our subproblem for and must contain the root pair , the objective function on the input trees and can be represented as

(1)

We denote by the maximum value of (1). If at least one of and is a leaf, . Thus, in the following, we assume that neither nor is a leaf. The idea for our dynamic programming is that can be recursively computed from the values for and . To be precise, let be the set of all Tai mappings between and such that and both and are antichains in and , respectively. For a Tai mapping , we let and to denote and , respectively. The following lemma is a key ingredient of our formulation.

Lemma 1.

.

Proof.

We first show that the left-hand side is at most the right-hand side. Let be a Tai mapping between and with . Then, can be uniquely decomposed into such that for any , is a Tai mapping between and with and . Such a decomposition can be obtained by choosing minimal node pairs with respect to : For any either and , or and are not comparable to and , respectively. For each , we have . Therefore, .

To show the converse, let be maximizing the right-hand side. For each , we let be a Tai mapping between and such that and . Since and are antichains, is a Tai mapping between and . Therefore, we have and hence the lemma holds. ∎∎

By Lemma 1, our problem is to maximize

subject to .

Mori et al. [14] reduced the problem of finding a maximum weight Tai mapping in to the maximum vertex weight clique problem, which corresponds to the maximum weight independent set problem on complement graphs. Their reduction can be interpreted as the following constraint:

However, this formulation contains constraints.

In order to reduce the number of constraints, we will exploit a structure of rooted trees. For a node and a leaf , let be the unique path between and in . Then, for any and any (resp. ), at most one node of (resp. ) can be chosen in , that is,

This is formalized by the following lemma.

Lemma 2.

Let and . Then, can be computed by the following IP.

Proof.

By Lemma 1, it suffices to prove that is in if and only if is a feasible solution.

Suppose first that . Since forms an antichain in , has at most one node in for each . Therefore, binary variables do not violate the first type constraints. A symmetric argument for implies that is a feasible solution for the IP.

Suppose, for contradiction, is a feasible solution and there are in that violate the condition of . There are two possibilities: and violate the one-to-one correspondence of Tai mapping or at least one of or holds. For the former case, assume without loss of generality that and . In this case, the pairs contribute at least two to a constraint for each , which contradict the feasibility of . For the latter case, assume without loss of generality that . In this case, there is a path that contains both and . The pairs contribute at least two to a constraint for such , which also contradict the feasibility of . Therefore, the lemma holds. ∎∎

For and , we can compute by using the formulation of Lemma 2. The remaining task is to compute from the values .

Theorem 2.

Let be the optimal value of the following IP. Then, .

The proof of theorem 2 is analogous to those of Lemma 1 and 2. Our method has subproblems. Each subproblem, however, contains variables and only constraints.

4.2 Improved Methods for Variants of Edit Distance

Figure 1: The figure illustrates the reduction from the maximum segmental mapping problem to the maximum matching problem in a bipartite graph.

As the edit distance was computed in the previous section, the other distances can also be computed in the same manner: For each amd , compute , and then combine the solutions of subproblems as in Theorem 2.

Segmental Distance

Let and be nodes of two trees and , respectively. We denote here by the maximum weight, that is the maximum value of (1), of segmental mappings between and with . If either or is a leaf, we have . Thus, we suppose otherwise. Suppose have already computed for each . Observe that for any segmental mapping with , a child of must be paired with a child of in . Moreover, if a descendant of that is not a child of is in , the child of that is an ancestor of must be in . These observations imply that can be constructed by a union of mappings for and , where is a mapping between and with . Therefore, in order to compute , we construct a bipartite graph as follows. For each , we create a vertex and for each and , add an edge between and whose weight equals as in Fig. 1. It is well-known that a maximum weight bipartite matching can be solved in polynomial time using Hungarian method [13].

When is computed for each and , we can compute the segmental distance between and by Theorem 2.

Bottom-up Segmental Distance

Because any bottom-up segmental mapping is a segmental mapping, the above observations also hold and each subproblem can be reduced to a maximum weight matching problem in a bipartite graph as well. The only difference from the case of segmental distance is that every segment must include at least one leaf. To this end, we need to exclude the following two cases from our solution. If exactly one of and is a leaf, then must be zero since violates the condition of bottom-up segmental mapping. The other case is that neither nor is a leaf and the solution of the maximum weight matching equals zero. This implies that an optimal mapping between and consists of a single pair , which also violates the condition of bottom-up segmental mapping. Therefore, we set in this case.

Bottom-up Distance

First, we propose a naive IP formulation for computing bottom-up distance. A straightforward implication from Definition 6 is that if , the mapping between and must be a bijection. The formulation can be obtained from that of Tai mapping by adding the following constraints:

This formulation contains variables and constraints.

Since bottom-up mapping is a subclass of bottom-up segmental mapping, we can apply the above technique as well. All we have to do is to consider the case when two trees and are structurally isomorphic. Thus, for and , we set if two subtrees and are not structurally isomorphic, i.e., they are isomorphic ignoring the labels.

Our improved methods contain subproblems which can be solved in polynomial time. For combining the solutions of these subproblems, we need to solve an integer program in Theorem 2. Such IPs also have variables and constraints.

5 Experiments

To compare the experimental performance of our methods and the previous methods, we applied them to real tree-structured data. We used glycan data obtained from KEGG/Glycan database [10] and CSLOGS dataset [18] which consists of web log files. In our experiments, we adopt the unit cost for the cost function, which is defined as:

We implemented the previous methods for computing edit distance (IP_Edit), segmental distance (IP_Sg), and bottom-up segmental distance (IP_BotSg) given by Kondo et al. [11] and a naive method for computing bottom-up distance (IP_Bot) described in the previous section. We also implemented our methods for computing these four distances (DpIP_Edit, DpIP_Sg, DpIP_BotSg, and DpIP_Bot). In addition to the above implementations, we intended to compare our methods with the algorithm due to Mori et al. [14]. Their algorithm reduces the tree edit distance problem to the maximum weight clique problem and uses the maximum weight clique algorithm due to [15]. However, the purpose of our experiments is to compare formulations or reductions rather than the performance of specific IP or other solvers. Therefore, we used an ordinary IP formulation of the maximum weight clique problem instead of the algorithm of [15], which is denoted by IP_DpClique_E.

We implemented the methods mentioned above in Java 1.8 combined with IBM ILOG CPLEX 12.7. We have forced CPLEX to run in sequential mode, setting parameter IloCplex.IntParam.Threads to one. Every implementation of the presented methods is also single-threaded. The experiments were performed using a computer with 3.7 GHz Quad-Core Intel Xeon E5 and 32 GB RAM, under the Mac OS X.

5.1 Glycan dataset

The results for edit distance with Glycan dataset are shown in Table 1. “# of nodes” in the table means the total number of nodes of two input trees. We randomly selected at most 100 input tree pairs from the Glycan dataset for each range of total nomber of nodes. Avg and t.o. stand for average execution time (in seconds) and the number of instances timed out, respectively. The table shows that DpIP_Edit is much faster than IP_Edit. IP_DpClique_E is not faster than IP_Edit when the size of inputs are large, while IP_DpClique_E outperforms IP_Edit when the inputs are small-sized trees. It is shown that DpIP_Edit also outperforms IP_DpClique_E. It implies that it is not sufficient to adopt a dynamic programming aproach for improving on the practical performance, and the revised IP formulation derived from the dynamic programming is of great importance for reducing the running time on the tree edit distance problem.

Table 2 shows the results for the variants of edit distance. For segmental distance and bottom-up segmental distance, the proposed methods (DpIP_Sg and DpIP_BotSg) finished computing within 1 second while the naive methods (IP_Sg and IP_BotSg) take longer than 30 seconds if the total size of input trees is large. For bottom-up distance, the naive method (IP_Bot) was fast as all instances were computed within 30 seconds. However, our improved method (DpIP_Bot) is still much faster than the naive method.

# of nodes # of instances IP_Edit DpIP_Edit IP_DpClique_E
avg t.o. avg t.o. avg t.o.
50 - 54 100 2.393 0 0.308 0 0.994 0
55 - 59 100 4.661 0 0.417 0 1.576 0
60 - 64 88 11.661 6 0.576 0 2.894 0
65 - 69 36 17.774 4 0.669 0 3.433 0
70 - 74 100 13.209 7 0.654 0 11.799 7
75 - 79 29 20.771 9 0.823 0 11.411 7
80 - 84 9 18.705 8 1.094 0 14.941 6
85 - 89 5 0 5 1.330 0 21.838 3
90 - 94 4 0 4 1.442 0 0 4
Table 1: Experimental results with Glycan for edit distance
# of nodes # of instances IP_Sg DpIP_Sg IP_BotSg DpIP_BotSg IP_Bot DpIP_Bot
avg t.o. avg t.o. avg t.o. avg t.o. avg t.o. avg t.o.
50 - 54 100 5.306 0 0.135 0 1.545 0 0.136 0 0.569 0 0.131 0
55 - 59 100 9.070 5 0.135 0 2.539 0 0.139 0 0.785 0 0.131 0
60 - 64 88 13.983 41 0.137 0 4.767 0 0.142 0 1.258 0 0.132 0
65 - 69 36 23.813 27 0.140 0 6.219 0 0.147 0 1.544 0 0.133 0
70 - 74 100 20.408 97 0.145 0 10.252 4 0.150 0 1.453 0 0.134 0
75 - 79 29 21.274 27 0.148 0 12.794 5 0.154 0 2.021 0 0.137 0
80 - 84 9 0 9 0.152 0 17.606 3 0.160 0 3.002 0 0.137 0
85 - 89 5 0 5 0.157 0 29.157 4 0.163 0 3.869 0 0.142 0
90 - 94 4 0 4 0.161 0 0 4 0.166 0 4.476 0 0.145 0
Table 2: Experimental results with Glycan for segmental distance, bottom-up segmental distance, and bottom-up distance

5.2 CSLOGS Dataset

We divided CSLOGS dataset into two subsets: SUBLOG3 and SUBLOG49. Every tree in SUBLOG3 (resp. SUBLOG49) is restricted to have the maximum degree at most 3 (resp. 49). We randomly selected at most 100 pairs from each dataset with a specified range of the total number of nodes.

The results of computation for SUBLOG3 are shown in Table 3 and 4. Table 5 and 6 shows the results for SUBLOG49. Compared to the results in SUBLOG3, the naive methods (IP_Edit, IP_Sg, IP_BotSg, and IP_Bot) in SUBLOG49 works faster. This property is what has been observed in the previous work by Konto et al. In regard of IP_DpClique_E, it outperforms IP_Edit when the degrees of trees are small, though their performances are scarcely different with high-degree inputs.

# of nodes # of instances IP_Edit DpIP_Edit IP_DpClique_E
avg t.o. avg t.o. avg t.o.
50 - 54 100 2.478 0 0.435 0 3.853 0
55 - 59 100 3.892 0 0.510 0 5.393 2
60 - 64 100 6.641 0 0.633 0 8.243 17
65 - 69 100 9.921 1 0.760 0 7.191 34
70 - 74 100 15.077 9 0.917 0 8.244 44
75 - 79 100 16.534 29 1.112 0 6.352 47
80 - 84 100 19.024 45 1.247 0 5.144 44
85 - 89 100 21.249 70 1.449 0 4.711 48
90 - 94 100 23.946 91 1.872 0 6.863 59
95 - 99 100 26.599 92 2.136 0 7.971 61
Table 3: Experimental results with SUBLOG3 for edit distance
# of nodes # of instances IP_Sg DpIP_Sg IP_BotSg DpIP_BotSg IP_Bot DpIP_Bot
avg t.o. avg t.o. avg t.o. avg t.o. avg t.o. avg t.o.
50 - 54 100 5.978 0 0.136 0 1.970 0 0.140 0 0.568 0 0.131 0
55 - 59 100 10.208 7 0.136 0 2.922 0 0.141 0 0.764 0 0.132 0
60 - 64 100 13.791 31 0.141 0 5.245 0 0.145 0 1.076 0 0.134 0
65 - 69 100 18.372 57 0.144 0 6.562 1 0.148 0 1.390 0 0.135 0
70 - 74 100 20.195 75 0.146 0 8.513 15 0.151 0 1.856 0 0.137 0
75 - 79 100 22.485 87 0.149 0 11.003 10 0.154 0 2.372 0 0.138 0
80 - 84 100 22.865 91 0.150 0 12.489 18 0.157 0 3.031 0 0.139 0
85 - 89 100 26.028 94 0.154 0 14.864 25 0.160 0 3.746 0 0.140 0
90 - 94 100 26.866 98 0.158 0 17.244 48 0.167 0 4.861 0 0.144 0
95 - 99 100 0 100 0.160 0 18.644 57 0.170 0 5.808 0 0.147 0
Table 4: Experimental results with SUBLOG3 for segmental distance, bottom-up segmental distance and bottom-up distance
# of nodes # of instances IP_Edit DpIP_Edit IP_DpClique_E
avg t.o. avg t.o. avg t.o.
50 - 54 100 1.275 0 0.263 0 1.643 0
55 - 59 100 2.323 0 0.317 0 3.014 0
60 - 64 100 4.032 0 0.395 0 5.452 3
65 - 69 100 4.756 0 0.402 0 6.721 6
70 - 74 100 6.231 1 0.450 0 7.188 10
75 - 79 100 8.808 10 0.567 0 9.787 19
80 - 84 100 11.850 6 0.583 0 10.037 28
85 - 89 100 12.429 21 0.665 0 10.145 34
90 - 94 100 13.595 33 0.678 0 11.228 34
95 - 99 100 15.711 30 0.829 0 12.084 39
Table 5: Experimental results with SUBLOG49 for edit distance
# of nodes # of instances IP_Sg DpIP_Sg IP_BotSg DpIP_BotSg IP_Bot DpIP_Bot
avg t.o. avg t.o. avg t.o. avg t.o. avg t.o. avg t.o.
50 - 54 100 2.130 0 0.143 0 0.739 0 0.142 0 0.376 0 0.130 0
55 - 59 100 4.704 0 0.147 0 1.521 0 0.145 0 0.514 0 0.133 0
60 - 64 100 6.795 11 0.151 0 2.863 3 0.150 0 0.707 0 0.153 0
65 - 69 100 7.741 8 0.162 0 2.544 1 0.154 0 0.830 0 0.135 0
70 - 74 100 9.277 19 0.158 0 3.257 2 0.159 0 1.036 0 0.139 0
75 - 79 100 12.421 38 0.162 0 5.143 6 0.162 0 1.376 0 0.139 0
80 - 84 100 12.707 39 0.167 0 5.788 7 0.169 0 1.644 0 0.142 0
85 - 89 100 14.817 46 0.170 0 7.136 3 0.176 0 2.129 0 0.144 0
90 - 94 100 13.267 65 0.175 0 8.479 8 0.179 0 2.361 0 0.147 0
95 - 99 100 16.752 65 0.181 0 8.776 16 0.184 0 2.881 0 0.148 0
Table 6: Experimental results with SUBLOG49 for segmental distance, bottom-up segmental distance and bottom-up distance

We can observe that the proposed methods (DpIP_Edit, DpIP_Sg, DpIP_BotSg, and DpIP_Bot) show remarkably improved the previous methods (IP_Edit, IP_Sg, IP_BotSg, and IP_Bot) as most of instances are computed within 0.2 seconds. In order to measure the scalability of the proposed methods, we used the wide range of dataset. We selected input tree pairs so that the number of total nodes ranges from around 0 to around 850. The results are shown in Fig. 2. For segmemtanl distance and bottom-up segmental distance, the smallest instance which exceeds our time limit of 30 seconds appears when the total number of nodes belongs to range 450 - 500 whereas it appears for tree edit distance when the number of nodes belongs to range 150 - 200. For bottom-up distance, all instances selected in this experiments are solved within 7 seconds.

Figure 2: The crosses, triangles, circles and squares represent the instances of the edit distance, segmental distance, bottom-up distance, and bottom-up distance problem, respectively.

6 Conclusion and Discussion

We have proposed improved methods for computing the tree edit distance and its variants. While the naive IP formulation proposed by Kondo et al. [11] has constraints, our efficient IP formulation, though it has subproblems, only has constraints. In case of segmental distance, bottom-up segmental distance and bottom-up distance, each subproblem, except for the problem combining the solutions of subproblems, can be reduced to the maximum weighted matching problem in a bipartite graph, which can be solved in polynomial time.

We performed some experiments using real tree-structured dataset. While the previous method only works for small-sized trees, our methods are still effective for large-sized trees. In particular, for segmental distance and bottom-up segmental distance, our methods are available for trees whose total size is up to 450, and for bottom-up distance, every instance is solved within 7 seconds.

An advantage of IP-based method is that we can easily give an IP fomulation for another distance by adding some constraints to the IP formulation for edit distance. Therefore, extending our method to another important distance measure between unordered trees such as tree alignment distance [8] would be our future work. It would be interesting to develop practical algorithms for computing those distances without using general purpose solvers such as IP solvers or SAT solvers.

References

References

  1. Akutsu, T., Fukagawa, D., Halldorsson, M.M., Takasu, A., Tanaka, K.: Approximation and parameterized algorithms for common subtrees and edit distance between unordered trees. Theoretical Computer Science 470, 10–22 (2013)
  2. Akutsu, T., Fukagawa, D., Takasu, A., Tamura, T.: Exact algorithms for computing the tree edit distance between unordered trees. Theoretical Computer Science 412(4-5), 352–364 (2011)
  3. Akutsu, T., Tamura, T., Fukagawa, D., Takasu, A.: Efficient exponential-time algorithms for edit distance between unordered trees. Journal of Discrete Algorithms 25, 79–93 (2014)
  4. Demaine, E.D., Mozes, S., Rossman, B., Weimann, O.: An optimal decomposition algorithm for tree edit distance. ACM Transactions on Algorithms 6(1), 1–19 (2009)
  5. Fukagawa, D., Tamura, T., Takasu, A., Tomita, E., Akutsu, T.: A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures. BMC Bioinformatics 12(Suppl 1), S13 (2011)
  6. Higuchi, S., Kan, T., Yamamoto, Y., Hirata, K.: An A* Algorithm for Computing Edit Distance between Rooted Labeled Unordered Trees. In: New Frontiers in Artificial Intelligence, pp. 186–196. Springer Berlin Heidelberg (2012)
  7. Horesh, Y., Mehr, R., Unger, R.: Designing an A* Algorithm for Calculating Edit Distance between Rooted-Unordered Trees. Journal of Computational Biology 13(6), 1165–1176 (2006)
  8. Jiang, T., Wang, L., Zhang, K.: Alignment of trees — an alternative to tree edit. Theoretical Computer Science 143(1), 137–148 (1995)
  9. Kan, T., Higuchi, S., Hirata, K.: Segmental Mapping and Distance for Rooted Labeled Ordered Trees. In: Algorithms and Computation, pp. 485–494. Springer Berlin Heidelberg (2012)
  10. Kanehisa, M., Goto, S.: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28(1), 27–30 (2000)
  11. Kondo, S., Otaki, K., Ikeda, M., Yamamoto, A.: Fast Computation of the Tree Edit Distance between Unordered Trees Using IP Solvers. In: Discovery Science, pp. 156–167. Springer International Publishing (2014)
  12. Kuboyama, T.: Matching and Learning in Trees. Ph.D. thesis, The University of Tokyo (2007)
  13. Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Research Logistics Quarterly 2(1-2), 83–97 (1955)
  14. Mori, T., Tamura, T., Fukagawa, D., Takasu, A., Tomita, E., Akutsu, T.: A Clique-Based Method Using Dynamic Programming for Computing Edit Distance Between Unordered Trees. Journal of Computational Biology 19(10), 1089–1104 (2012)
  15. Nakamura, T., Tomita, E.: Efficient algorithms for finding a maximum clique with maximum vertex weight (in Japanese). Technical Report, the University of Electro-Communications (2005)
  16. Tai, K.C.: The Tree-to-Tree Correction Problem. Journal of the ACM 26(3), 422–433 (1979)
  17. Valiente, G.: An efficient bottom-up distance between trees. In: Proceedings Eighth Symposium on String Processing and Information Retrieval. IEEE (2001)
  18. Zaki, M.: Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)
  19. Zhang, K., Jiang, T.: Some MAX SNP-hard results concerning unordered labeled trees. Information Processing Letters 49(5), 249–254 (1994)
  20. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unordered labeled trees. Information Processing Letters 42(3), 133–139 (1992)