# Error Tree: A Tree Structure for Hamming & Edit Distances & Wildcards Matching

###### Abstract

Error Tree is a novel tree structure that is mainly oriented to solve the approximate pattern matching problems, Hamming and edit distances, as well as the wildcards matching problem. The input is a text of length over a fixed alphabet of length , a pattern of length , and . The output is to find all positions that have Hamming distance, edit distance, or wildcards matching with . The algorithm proposes for Hamming distance and wildcards matching a tree structure that needs words and takes )() in the average case) of query time for any online/offline pattern, where is the number of outputs. As well, a tree structure of words and )() in the average case) query time for edit distance for any online/offline pattern.

## 1 Introduction

In the middle of the increasing growth of the internet-based searching, information retrieval, data mining applications, and bioinformatics researches; there is an increasing necessity of the problem of searching whether a given pattern happened to occur as exact match, approximate match, or wildcards match in a given database. The pattern is usually of small size such as words or sentences, and the database is of much larger size such as web documents, genomes, and books.

The exact matching problem is the simplest form of the pattern matching problems, while the approximate and wildcards matching are more complicated. For the exact matching problem, [15] proposed a tree structure which was improved by [12] and [14] and led to an optimal solution of linear tree structure and linear query time.

Approximate matching involves two problems; Hamming distance and edit distance (also known as Levenstein distance). The Hamming distance between two strings is the minimal number of substitution operations required to transform the first string into the second string. While edit distance is the minimal number of substitution, insertion, or deletion operations required to transform the first string into the second string. On the other hand, wildcards matching problem is when the pattern has a wildcard, also known as don’t care character, represented by , that can match with any other character in the alphabet set. For these matching problems, there is yet no linear (in structure size and query time) solution for them.

In this paper we are considering the following problems with the following inputs and outputs:

Problem 1: Dictionary Matching

Inputs: Given database of strings, each string is of length symbols of finite set of size , and ; A pattern = ; and .

Outputs: All strings in the database that have with Hamming distance (K-Hamming distance), edit distance (K-edit distance), or wildcards matching (K-wildcards matching).

Problem 2: Text Matching

Inputs: Given a text of n symbols of finite set of size , a pattern = , and .

Outputs: All positions of the subsequences in the text that have with K-Hamming distance, K-edit distance, or K-wildcards matching.

Note that problem 1 is a simple case of problem 2. This paper will start describing the data structure for problem 1; as it is easier to be described than problem 2. After that, the design for problem 2 will be described.

In this paper an algorithm is introduced for a novel tree structure that mainly leads to an efficient bound for the above problems.

Outline: Section 2 surveys the background and related work of the problems. Then, section 3 states the preliminaries. Next, section 4 shows the design of the error tree structure for the first problem. After that, the modified design for problem 2 will be shown in section 5. Finally, section 6 states the conclusions.

## 2 Background and Related Work

For the problem of text indexing, the naive algorithm takes time. A faster algorithm, Kangaroo method, was proposed by [10] that build least common ancestor tree [7], which allows to find each alignment in time, hence the time cost will be . A better algorithm was proposed by [1] that compute all alignments in . The recent algorithm takes which was proposed by [2]. Note that all the bounds above are in the order of .

Recently, [6] proposed a data structure that takes words and find the k-Hamming distance in time, as well as a structure of words which finds the k-edit distance in , where are constants. The data structure of [5] can output the k-edit distance in . The construction of their index structure needs words of space in the average case, and takes of time where is the number of nodes in the index. An upper bound algorithm was proposed by [13], where the space complexity of the index structure is for any constant , but with a query time of for both k-Hamming distance and k-edit distance.

Many algorithms solved both distances using a lower structure space but using an upper query time. Among these algorithms, [4] presented a linear space index , but with of query time. In [9], a data structure of bits was proposed, that takes of query time; while using bits of space, the query time will be , where . Using more space, [8] showed an index structure that needs bits and takes of query time. They also reduced the index space to by increasing the query time by factor of .

For the k-wildcards matching problem, there is a light reduction in the time and space cost over the distances problems. The design of error tree structure in this paper allows solving the k-wildcards matching with the same bound of k-Hamming distance. Many algorithms proposed structures for solving the problem such as [6], [3], and [11]. [6] proposed a structure of words, that solves the problem in . While [3] generalized the structure of [6] and reduced the space to words, but the query time increased to , where . Less space structure was presented in [11], where the needed space is bits, but with slight increase in the query time to .

Error Tree is a novel tree structure that is mainly oriented to solve the aforementioned problems. Firstly by constructing a tree structure that cost words for Hamming distance and wildcards matching, and takes )() in the average case) of query time for any online/offline pattern. As well a tree structure of words and ) ( ) in the average case) of query time for edit distance of any online/offline pattern.

## 3 Preliminaries

For a string , is the length of string , is the substring from position to position , and is suffix of string . For a list , is the item at the index , is the item at the last index, similarly is the item before the last item in .

## 4 Dictionary Matching Algorithm

In order to explain the steps of the algorithm, the paper will firstly show the steps for k = 1, and then explain the general design when k 2.

### 4.1 k = 1’s Case

The algorithm involves two stages; construction of a tree, then searching for the strings that have Hamming distance with equal to 1, a.k.a strings with 1-Hamming distance or 1-mismatch.

#### 4.1.1 Construction stage

1. Firstly, a generalized suffix tree, [14], needs to be built for the strings in the database. So, by the definition of the suffix tree, we will have leaf node for each suffix of the strings in the database. words.

2. All leaves and internal nodes in the suffix tree will be assigned a unique key. words.

Definition 1: A suffix tree that has a unique key that identifies each leaf node and each internal node in the suffix tree is called keyed suffix tree (KST).

Definition 2: For a string s and a given KST, the function All Visited Nodes denoted as AVN(s) returns a list of the nodes keys and the edges lengths with their order that results from walking s in the KST.

Corollary 1: For the case of k = 1 since in problem 1 we already have leaves for all suffixes of the strings in the database, we can find AVN()[-1] for any suffix s of any sting in the database in a constant time. Because AVN()[-1] is the key of the last visited node after traversing the suffix in KST, which must be a leaf node. This can be by hashing all the leaves and returning AVN()[-1] in a constant time without walking in the KST.

Corollary 2: Given a KST, and 2 strings, and , where len() = len() and is in the KST. If Hamming distance (, ) = 1, and the mismatch occur at position x where x is not the last position, then: AVN(suff(, x+1)) = AVN(suff(, x+1)). Similarly AVN(suff(, x+1))[-1] = AVN(suff(, x+1))[-1].

3. Construct a compact trie of all the strings in the database, time and space.

4. For each internal node , a hash table is initialized. Then for all leaves of the subtree rooted at , and assuming at level (symbol depth) , then for all in , we first pick any string labeled at , then add to a tuple of . So:

ΩFor each internal node v in the tree:Ωinitialize a hash table I_1Ω// get the level of the nodeΩi = get_level(v)Ω// get node descΩL = get_desc_leaves(v)Ω\parFor l in L:Ωpick a string s labeled at lΩv.I_1.add(AVN(suff(s, i+1)[-1], l)Ω\par

Ω

Definition 3: We call such a tree structure, without loss of generality, a 1-error tree as it was constructed to find 1 mismatch.Ω

, we will perform step 4 for internal nodes, and at each node we will be bounded to the number of descendants leaves. If the 1-error tree is unbalanced, then we will not perform step 4 on the leaves under a branch that is on a heavy path (the path of most descendant leaves), namely at each node we have branches, all descendant leaves of the branch on a heavy path will be excluded in step 4 and will be treated in the query stage as edge. This means that balanced trie will be the worst case scenario. So, the bound will be O(NlogN) words of space.Ω

#### 4.1.2 Query stage:

Ω\parFirstly, given the pattern P, all its suffixes need to be added to the KST, and compute for each suffix. So, the results will be the following list :Ω{AVN(suff(P, 1))[-1], AVN(suff(P, 2)[-1]),…, AVN(suff(P, m))[-1]}.Ω\parSecondly, we will walk in the 1-error tree as the following: If the walking is on edge, and the next symbol in match with the next symbol, then continue as exact match. If the next symbol in doesn’t match with the next symbol at level , this means that we reach 1 mismatch, and we can jump over the next symbol (since the walking is on edge) and continue as exact match until we reach a leaf, if any, outputs the strings labeled at that leaf as 1-Hamming distance at position .Ω\parNow, if the walking reaches a node where at level , then look whether the key is in the table of (constant time cost as is a hash table). If yes, all strings labeled at the leaves that were associated with key have 1-Hamming distance at position ; then continue as exact match and the searching for k1 mismatch. If next symbol in doesn’t match with any child of , then stop searching.Ω

#### 4.1.3 Extension for indels

ΩThe design can be extended to handle the operations of insertions and deletions; which means we can output all strings that have edit distance of K=1 with , instead of only the Hamming distance.Ω\parBeforeof all, because insertion and deletion will cause shifting in the suffixes, such shifts must be tracked and manipulated by the design of the algorithm; mainly the AVN function.Ω\parIftwo strings and have edit distance of score 1 caused by deletion operation at position of , this means that . Now as AVN function starts at the root node and end up at a node that should has a unique key, the design should guarantee that. For , it must end up on a node and this will not cause any conflict in computing AVN. On the other hand, , which is actually 1=k level up of suffix , may cause a conflict because it may not be a leaf node. Therefor, this position must be guaranteed to be a leaf node with a unique key. Thus, such a preprocessing step must be performed.Ω

1. For each internal node in the 1-error tree, and for each leaf of all leaves of the subtree rooted at , and assuming at level . Firstly pick any string labeled at , then we will walk up by 1=k level of the parent node of the leaf node of . If we reach a node, let’s say , we will check if has a leaf node as a child, if not create new leaf node with unique key. If there is no node, then a new node with a unique key will be created, next as a child of this new node we will create a leaf node with a unique key. The cost will be space and time. This will help to track the effects of shifting the suffixes because of the deletions and the insertions.Ω\parInsertionsand deletions can occur in the pattern or in the strings. Before explaining the 4 cases, we will need to introduce the following corollary, the fact that step 1 was already performed:Ω

Corollary 3: Given a KST and 2 strings, and , where len() = len() = and is in the KST. If edit distance (, ) = 1 and the edit operation is a deletion at position x in where x is not the last position, then:Ω\parAVN(suff(, x))[1:m-x-1] = AVN(suff(, x+1)). Similarly AVN(suff(, x))[-2] = AVN(suff(, x+1))[-1]. Note that suff(, x))[1:m-x-1] will always end up at a leaf node (that must has a unique key) because of step 1.Ω

Corollary 4: Given a KST and 2 strings, and , where len() = len() = and is in the KST. If edit distance (, ) = 1 and the edit operation is an insertion at position x in where x is not the last position, then:Ω\parAVN(suff(, x+1)) = AVN(suff(, x)[1:m-x-1]). Note also that suff(, x)[1:m-x-1] will always end up at a leaf node (that must has unique key) because of step 1, so similarly AVN(suff(, x+1))[-1] = AVN(suff(, x))[-2].Ω

2. For the edit distance, the following 4 cases will be possible:Ω

Case 1: Deletion in the strings; Note that using the table we can handle this case. Based on corollary 3, we can check whether AVN(suff(, i)[-2] in or not. If yes, then all strings labeled at leaves that were associated with the key of AVN(suff(, i)[-2] must have edit distance with as a deletion in them at position .Ω

Case 2: Insertion in the strings; For this case, another hash table need to be initialized. Then step 4 of section 4.1.1 will be performed, but instead of adding (AVN(suff(s, ))[-1], l) into , (AVN(suff(s, ))[-2], l) will be added to . Note that because of step 1, AVN(suff(s, ))[-2] will always be a leaf node with a unique key. This will allow to check whether AVN(suff(, )[-1] in or not, based on corollary 4. If yes, then all strings labeled at leaves that were associated with the key AVN(suff(, )[-2] must have edit distance with as an insertion in them at position .Ω

proceeding to the next two cases, note that a deletion in the strings is similar to an insertion in the pattern. Likewise, an insertion in the strings is similar to a deletion in the pattern.Ω

Case 3: Deletion in the pattern; there is no need to modify the construction of the 1-error tree. This case can be computed by searching AVN(suff(, )[-1] in .Ω

Case 4: Insertion in the pattern; this case can be computed by searching AVN(suff(, )[-2] in .Ω\parInconclusion, at each internal node we will have three tables correspondent to the operations of mismatch and insertion. The cost for step 1 will be words of space and time. The cost for step 2, which is computing will be the same as computing in step 4 of section 4.1.1, which is words of space.Ω

### 4.2 K 2’s Case

ΩIn the case of K=1, the main step in the design is that we associate the key of last node(leaf), AVN(.)[-1], of the suffixes with the leaves label. In the K 2 case, we will associate the keys of all the nodes that were returned by AVN(.) for a suffix , with all tuples, that have , in the tables on the nodes on the path of in the (k-1)-error tree. Before describing the steps of the design, we will state the following corollary:Ω

Corollary 5: Given a KST and 2 strings, and , where len() = len() and is in the KST. If Hamming distance (, ) = and the mismatches occur at positions and at each level of positions in in the path to there is a node; then:Ω

,Ω

,Ω

.Ω

.Ω

Ω

well equivalentlyΩ

,Ω

,Ω

.Ω

.Ω

Ω

#### 4.2.1 Construction stage

Ω1. The first step is to collect the nodes keys in KST. So, we will do the following for each internal node :Ω

1.1 At node , we know the descendants leaves under that node and at what level assuming . We firstly initialize a hash table , then for each leave in we will pick any string . Then, compute AVN(suff(s, i + 1)); note that in the computation of AVN(), If the walking is at edge, then AVN() will return the length of that edge and a tag indicating that we are at edge. If we walk at node, then it will return the key of that node and a tag indicating that we are at a node. Note that we will have items in , since the balanced trie is the worst case scenario for the design.Ω

1.2 After computing AVN(suff(s, i + 1)), first walk in the k-error tree to the leaf with the skipping of 1 levels, and do the following:Ω

\parifnext node u in AVN(suff(s, i + 1)) is aligned in the middle of an edge:\parv.I_k.add( ((u.key(), edge), l))\parifnext node u1 in AVN(suff(s, i + 1)) aligned with a node u2:for each tuple p in u2.I_k-1 that has l:v.I_k.add( (u1.key(), p[1]),..., p[k]))\par\parNotethat

we will not walk explicitly to leaf . We will check the alignment between a visited node in the path with a node in AVN(suff(s, i + 1)), or the length of edge we visit in the path with edges length in AVN(suff(s, i + 1)), which is a simple convolution. So, we will have the following cases while walking to leaf in the k-error tree:

Case 1: Next node in AVN(suff(s, i + 1)) is aligned in the middle of edge in the k-error tree. Then, add to of a tuple contains the key of , a tag indicating the alignment was at edge, and .

Case 2: Next node in AVN(suff(s, i + 1)) is aligned to a node while walking in the k-error tree. Then for each tuple that has in of ; we will associate, as a tuple, the key of and the items in , in their order, then add the tuple into of .

end of walking, note that each of a nodes key may get associated/multiplied to a tuples that are in during the walking to leaf . So eventually, each level will cost words of space. As we have levels, the bound will sum up to .

1.3 Steps 1.1 and 1.2 were performed for suff(s, i+1) and on the tables and skipping 1 level. Similarly, we will do the same steps for suff(s, i+2),…,suff(s, i+k-1) on the tables of ,…, and skipping 2,…, k-1; respectively.

2. So far, we are covering the case where at each internal node, the symbols at the first levels are errors. But this will not cover the case where we will have all the first symbols after the internal nodes are actually errors. For this case, we need to perform step 4 of section 4.1.1, for suff(s, i + k) instead of suff(s, i + 1) suffixes as in the case of K = 1. The cost for this case will be as case k = 1, words of space.

, this will lead us to have the cost of constructing k-error tree for any to be words.

#### 4.2.2 Query stage

When , the number of possible error positions will be , as is the length of . When , the number of possible combinations of error positions would be , which is bounded to .\parBeforestatingthe steps we need to describe the following cases:

Case 1: Walking a suffix in the KST will diverge at internal node. For this case, the design of the algorithm is already covering this case, as we have already marked all internal nodes in the KST, back in the error tree.

Case 2: Walking a suffix in the KST will diverge in the middle of edge. In this case, we will allow jumping (skipping next symbol in the edge) times during the walking at that edge (and/or any coming edge). If after k jumps, the walking ends up at a leaf; then deduct how many jumps were performed out of the mismatches value during the searching process. If the walking after k jumps ends at internal node, then this case is similar to case 1, but with deducting how many jumps were performed out of the mismatches value during the searching process. If after exactly k jumps we didnt reach a node (internal or leaf), this would mean that P will have no outputs at all of mismatches with any string in the database, as one of its suffixes couldnt reach a leaf or an internal node after allowing k jumps (where jumps are representing mismatches). Note that counting the jumps is only at edges and not on any internal node, as the algorithms design is already marking the internal nodes back in the error tree, and the jumps (assumingly errors) after these internal nodes are already accounted for in the design.\parAsaresult of these cases we will define the following function:

Definition 4: For a string s, an integer , and a given KST, the function All Visited Nodes with k jumps denoted as AVNJ(s, k) returns a list of the nodes

lengths with their order that resulted from walking s in the KST with allowing jumps (in case of mismatch) on only the edge; and the positions of the jumps, if occurred.

1. We will collect AVNJ(s, k) for each suffix of the pattern. After that, we will have lists. Lets call this list .

2. Walk the pattern in the k-error tree, then at each internal node , assuming is at level . We will search if has any of the combinations of keys which can be extracted from the list. If so, report the leaves labels that were associated with the keys combination as the output. If we walk on edge, we will perform skip (jump) over mismatches that we may reach in a simple convolution.

bound for querying mismatches would be time. Note that in the average case computing AVN(.) for all suffixes would traverse nodes, therefore we get bound of .

#### 4.2.3 Extension for indels

In order to handle the insertions and the deletions for any , we will need to consider the following modifications:

1. Guaranteeing that we have a leaf node at the level above all leaves in KST. For this, we will visit level above each leaf and perform step 1 in section 4.1.3. Note that during the construction of 1-error tree to (k-1)-error tree, we must have already created leaves nodes for each case of 1 to k-1.

3. For insertions: At each internal node, we will need to perform the steps in section 4.2.1 not for suff(s, i)[1:m-i-1]; where is the level of the node, instead of suff(s, i)[1:m-i-k-1], then add the results into table.

4. Likewise, we need to perform step 2 of section 4.2.1, but for suff(s, i)[1:m-i-k].

5. Note that the edit distance can be any combination of substitution, deletion, or insertion. For this, we will perform step 1.2, of section 4.2.1, for all the tables at node, namely and tables, not only ; then adding the results into a hash table . This will add up an extra space of words. So, the total cost for building k-error tree that handle edit distance will be ).

6. The number of combinations that will be needed to search for will increase in the factor of , hence the query time for edit distance will be )() in the average case).

## 5 Algorithm Design for Text Indexing

The design and the construction of the error tree for problem 2 are similar to the design and the construction of problem 1, but there are some differences. Here, we describe these differences and what preprocessing steps will be needed to resolve them in order to be able to apply the same design of problem 1.

1. The depth of all the paths in the suffix tree is not . Paths with depths are useless when we search for a pattern of length . As well, this will add more costs in backward traversing of the tree and during the creation of new nodes.

2. In problem 1, we have already leaves nodes for each suffix of the strings. In problem 2 this is not the case, because we have just one text string unlike problem 1. The leaves for the suffixes of each suffix (or specifically m-mer) are not explicitly constructed.\parNowinorder to resolve these issues; we will perform the following:

1. All paths in the suffix tree must have depth . For this, we need to traverse all paths in the suffix tree and count the depth of the path by summing the lengths of the edges on each path. When depth = is reached, then if that point is already a node, trim all edges/nodes below that node, and store explicitly the labels of the descendants leaves. If that point is on edge, then create a leaf node at that point, after that copy and store explicitly the labels of the descendants leaves of its sink node, lastly trim the edge below that point. The cost of this step will be time and space since we dont need to read the edges symbols, instead we will just need to read the length of the edges (constant time), as well we will create new nodes. We call such a suffix tree Trimmed Suffix Tree of depth , .

2. Starting from , we need to mark/tag suffixes of these suffixes similarly to the design of problem 1. Note that in problem 1 not all suffixes of the strings were considered in the design, since we only computed the AVN(.) for the suffixes of the descendants leaves under the internal nodes. Thus, of suffixes will be under consideration not .\parNowinorder to resolve this, note that after performing step 1 and for instance, the 6th suffix of suffix at position 1020 will be the prefix from root to position of the suffix . By this, the cost to guarantee/create a leaf node for the 6th suffix of suffix 1020 is to start from suffix 1026 leaf and walk as far as position , then make sure we have a leaf node there or create a new one. Again, there is no need to walk on the edge explicitly to reach point , as reading the length of the edges is enough. Hence, the cost will be . In conclusion, guaranteeing/creating leaf node for each considered suffixes at the internal nodes will need an extra cost of time.

3. There is no need to build another compressed trie for the text in this problem. As we may consider the as enough representation for all the error trees. So, all operations and all the k error trees can be constructed within the or using independent trees.\parSo, after making these modifications, we can build the k-error trees using the same steps in problem 1, and the cost will be words.\parThereisan exception for case k=1, because this case will need time but only words of space; the reason is that we need to find the AVN()[-1] for the considered suffixes at each internal node in time, but we will just store the AVN()[-1] value which is of a constant space.

## 6 Conclusion

In this paper, we introduce a tree structure that allows to solve non-trivial problems using very efficient bounds. For the problems of K-Hamming distance and K-wildcards matching, we propose a structure that needs words and takes )() in the average case) of query time for any online/offline pattern. A tree structure of words and )() in the average case) of query time for edit distance and any online/offline pattern.

## References

- [1] Karl Abrahamson. Generalized string matching. SIAM journal on Computing, 16(6):1039–1051, 1987.
- [2] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’00, pages 794–803, Philadelphia, PA, USA, 2000. Society for Industrial and Applied Mathematics.
- [3] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns with wildcards. Theory of Computing Systems, 55(1):41–60, 2014.
- [4] Ho-Leung Chan, Tak-Wah Lam, Wing-Kin Sung, Siu-Lung Tam, and Swee-Seong Wong. A linear size index for approximate pattern matching. In Combinatorial Pattern Matching, pages 49–59. Springer, 2006.
- [5] Luís Pedro Coelho and Arlindo L Oliveira. Dotted suffix trees a structure for approximate text indexing. In String Processing and Information Retrieval, pages 329–336. Springer, 2006.
- [6] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 91–100. ACM, 2004.
- [7] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. siam Journal on Computing, 13(2):338–355, 1984.
- [8] Trinh ND Huynh, Wing-Kai Hon, Tak-Wah Lam, and Wing-Kin Sung. Approximate string matching using compressed suffix arrays. Theoretical Computer Science, 352(1):240–249, 2006.
- [9] Tak-Wah Lam, Wing-Kin Sung, and Swee-Seong Wong. Improved approximate string matching using compressed suffix data structures. In Algorithms and Computation, pages 339–348. Springer, 2005.
- [10] Gad M Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239–249, 1986.
- [11] Moshe Lewenstein, J Ian Munro, Venkatesh Raman, and Sharma V Thankachan. Less space: Indexing for queries with wildcards. Theoretical Computer Science, 557:120–127, 2014.
- [12] Edward M McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM (JACM), 23(2):262–272, 1976.
- [13] Dekel Tsur. Fast index for approximate string matching. Journal of Discrete Algorithms, 8(4):339–345, 2010.
- [14] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.
- [15] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.