String Indexing for Patterns with WildcardsPreliminary version appeared in Proceedings of the 13th Scandinavian Symposium and Workshops on Algorithm Theory. Lecture Notes in Computer Science, vol. 7357, pp. 283–294, Springer 2012.

String Indexing for Patterns with WildcardsPreliminary version appeared in Proceedings of the 13th Scandinavian Symposium and Workshops on Algorithm Theory. Lecture Notes in Computer Science, vol. 7357, pp. 283–294, Springer 2012.

Abstract

We consider the problem of indexing a string of length to report the occurrences of a query pattern containing characters and wildcards. Let be the number of occurrences of in , and the size of the alphabet. We obtain the following results.

  • A linear space index with query time . This significantly improves the previously best known linear space index by Lam et al. [ISAAC 2007], which requires query time in the worst case.

  • An index with query time using space , where is the maximum number of wildcards allowed in the pattern. This is the first non-trivial bound with this query time.

  • A time-space trade-off, generalizing the index by Cole et al. [STOC 2004].

We also show that these indexes can be generalized to allow variable length gaps in the pattern. Our results are obtained using a novel combination of well-known and new techniques, which could be of independent interest.

1Introduction

The string indexing problem is to build an index for a string such that the occurrences of a query pattern can be reported. The classic suffix tree data structure [38] combined with perfect hashing [15] gives a linear space solution for string indexing with optimal query time, i.e., an space data structure that supports queries in time, where is the number of occurrences of in .

Recently, various extensions of the classic string indexing problem that allow errors or wildcards (also known as gaps or don’t cares) have been studied [11]. In this paper, we focus on one of the most basic of these extensions, namely, string indexing for patterns with wildcards. In this problem, only the pattern contains wildcards, and the goal is to report all occurrences of in , where a wildcard is allowed to match any character in .

String indexing for patterns with wildcards finds several natural applications in large-scale data processing areas such as information retrieval, bioinformatics, data mining, and internet traffic analysis. For instance in bioinformatics, the PROSITE data base [21] supports searching for protein patterns containing wildcards.

Despite significant interest in the problem and its many variations, most of the basic questions remain unsolved. We introduce three new indexes and obtain several new bounds for string indexing with wildcards in the pattern. If the index can handle patterns containing an unbounded number of wildcards, we call it an unbounded wildcard index, otherwise we refer to the index as a -bounded wildcard index, where is the maximum number of wildcards allowed in . Let be the length of the indexed string , and be the size of the alphabet. We define and to be the number of characters and wildcards in , respectively. Consequently, the length of is . We show that,

  • There is an unbounded wildcard index with query time using linear space. This significantly improves the previously best known linear space index by Lam et al. [24], which requires query time in the worst case. Compared to the index by Cole et al. [11] having the same query time, we improve the space usage by a factor .

  • There is a -bounded wildcard index with query time using space . This is the first non-trivial space bound with this query time.

  • There is a time-space trade-off for -bounded wildcard indexes. This trade-off generalizes the index described by Cole et al. [11].

Furthermore, we generalize these indexes to support variable length gaps in the pattern.

1.1Previous Work

Exact string matching has been generalized with error bounds in many different ways. In particular, allowing matches within a bounded hamming or edit distance, known as approximate string matching, has been subject to much research [25]. Another generalization was suggested by Fischer and Paterson [14], allowing wildcards in the text or pattern.

Work on the wildcard problem has mostly focused on the non-indexing variant, where the string is not preprocessed in advance [14]. Some solutions to the indexing problem consider the case where wildcards appear only in the indexed string [36] or in both the string and the pattern [11].

In the following, we summarize the known indexes that support wildcards in the pattern only. We focus on the case where , since for the problem is classic string indexing. For , Cole et al. [11] describe a selection of specialized solutions. However, these solutions do not generalize to larger .

Several simple solutions to the problem exist for . Using a suffix tree for [38], we can find all occurrences of in a top-down traversal starting from the root. When we reach a wildcard character in in location , the search branches out, consuming the first character on all outgoing edges from . This gives an unbounded wildcard index using space with query time , where is the total number of occurrences of in . Alternatively, we can build a compressed trie storing all possible modifications of all suffixes of containing at most wildcards. This gives a -bounded wildcard index using space with query time .

In 2004, Cole et al. [11] gave an elegant -bounded wildcard index using space and with query time. For sufficiently small values of this significantly improves the previous bounds. The key components in this solution are a new data structure for longest common prefix (LCP) queries and a heavy path decomposition [20] of the suffix tree for the text . Given a pattern , the LCP data structure supports efficient insertion of all suffixes of into the suffix tree for , such that subsequent longest common prefix queries between any pair of suffixes from and can be answered in time. This is where the term in the query time comes from. The heavy path decomposition partitions the suffix tree into disjoint heavy paths such that any root-to-leaf path contains at most a logarithmic number of heavy paths. Cole et al. [11] show how to reduce the size of the index by only creating additional wildcard tries for the off-path subtries. This leads to the space bound. Secondly, using the new tries, the top-down search branches at most twice for each wildcard, leading to the term in the query time. Though Cole et al. [11] did not consider unbounded wildcard indexes, the technique can be extended to this case by using only the LCP data structure and omitting the additional wildcard tries. This leads to an unbounded wildcard index with query time using space .

The solutions described by Cole et al. [11] all have bounds which are exponential in the number of wildcards in the pattern. Very recently, Lewenstein [27] used similar techniques to improve the bounds to be exponential in the number of gaps in the pattern (a gap is a maximal substring of consecutive wildcards). Assuming that the pattern contains at most gaps each of size at most , Lewenstein obtains a bounded index with query time using space , where is the number of gaps in the pattern.

A different approach was taken by Iliopoulos and Rahman [22], who describe an unbounded wildcard index using linear space. For a pattern consisting of strings (subpatterns) interleaved by wildcards, the query time of the index is , where denotes the number of matches of in . This was later improved by Lam et al. [24] with an index that determines complete matches by first identifying potential matches of the subpatterns in and subsequently verifying each possible match for validity using interval stabbing on the subpatterns. Their solution is an unbounded wildcard index with query time using linear space. However, both of these solutions have a worst case query time of , since there may be matches for a subpattern, but no matches of . summarizes the existing solutions for the problem in relation to our results.

Table 1: = presented in this paper. The term denotes the number of matches of in and is in the worst case.

Type

Query Time Space Solution

Unbounded

Iliopoulos and Rahman
Lam et al.
Simple suffix tree index
ART decomposition
Cole et al.

-bounded

Heavy -tree decomposition
Cole et al.
Special index for
Simple linear time index

The unbounded wildcard index by Iliopoulos and Rahman [22] was the first index to achieve query time linear in while using space. Recently, Chan et al. [6] considered the related problem of obtaining a -mismatch index supporting queries in time linear in and using space. They describe an index with a query time of . However, this bound assumes a constant-size alphabet and a constant number of errors. In this paper we make no assumptions on the size of these parameters.

1.2Our Results

Our main contribution is three new wildcard indexes.

Compared to the solution by Cole et al. [11], we obtain the same query time while reducing the space usage by a factor . We also significantly improve upon the previously best known linear space index by Lam et al. [24], as we match the linear space usage while improving the worst-case query time from to provided . Our solution is faster than the simple suffix tree index for . Thus, for sufficiently small we improve upon the previously known unbounded wildcard indexes.

The main idea of the solution is to combine an ART decomposition [1] of the suffix tree for with the LCP data structure. The suffix tree is decomposed into a number of logarithmic-sized bottom trees and a single top tree. We introduce a new variant of the LCP data structure for use on the bottom trees, which supports queries in logarithmic time and linear space. The logarithmic size of the bottom trees leads to LCP queries in time . On the top tree we use the LCP data structure by Cole et al. [11] to answer queries in time . The number of LCP queries performed during a search for is , yielding the term in the query time. The reduced size of the top tree causes the index to be linear in size.

The theorem provides a time-space trade-off for -bounded wildcard indexes. Compared to the index by Cole et al. [11], we reduce the space usage by a factor by increasing the branching factor from to . For the index is identical to the index by Cole et al. The result is obtained by generalizing the wildcard index described by Cole et al. We use a heavy -tree decomposition, which is a new technique generalizing the classic heavy path decomposition by Harel and Tarjan [20]. This decomposition could be of independent interest. We also show that for the same technique yields an index with query time using space , where is the height of the suffix tree for .

To our knowledge this is the first linear time index with a non-trivial space bound. The result improves upon the space usage of the simple linear time index when . To achieve this result, we use the space index to obtain a black-box reduction that can produce a linear time index from an existing index. The idea is to build the space index with support for short patterns, and query another index if the pattern is long. This technique is closely related to the concept of filtering search introduced by Chazelle [7] and has previously been applied for indexing problems [3]. The theorem follows from applying the black-box reduction to the index of .

Variable Length Gaps

We also show that the three indexes support searching for query patterns with variable length gaps, i.e., patterns of the form , where denotes a variable length gap that matches an arbitrary substring of length between and , both inclusive.

String indexing for patterns with variable length gaps has applications in information retrieval, data mining and computational biology [16]. In particular, the PROSITE data base [21] uses patterns with variable length gaps to identify and classify protein sequences. The problem is a generalization of string indexing for patterns with wildcards, since a wildcard is equivalent to the variable length gap { 1, 1 }. Variable length gaps are also known as bounded wildcards, as a variable length gap { a_i, b_i } can be regarded as a bounded sequence of wildcards.

String indexing for patterns with variable length gaps is equivalent to string indexing for patterns with wildcards, with the addition of allowing optional wildcards in the pattern. An optional wildcard matches any character from or the empty string, i.e., an optional wildcard is equivalent to the variable length gap { 0, 1 }. Conversely, we may also consider a variable length gap { a_i, b_i } as consecutive wildcards followed by consecutive optional wildcards.

Lam et al. [24] introduced optional wildcards in the pattern and presented a variant of their solution for the string indexing for patterns with wildcards problem. The idea is to determine potential matches and verify complete matches using interval stabbing on the possible positions for the subpatterns. This leads to an unbounded optional wildcard index with query time and space usage . Here and denotes the number of matches of in , and since in the worst case, the worst case query time is . Recently, Lewenstein [27] considered the special case where the pattern contains at most gaps and for all , i.e., the gaps are non-variable and of length at most . Using techniques similar to those by Cole et al. [11], he gave a bounded index with query time using space , where is the number of gaps in the pattern.

The related string matching with variable length gaps problem, where the text may not be preprocessed in advance, has recieved some research attention recently [4]. However, none of the results and techniques developed for this problem appear to lead to non-trivial bounds for the indexing problem.

Our Results for Variable Length Gaps To introduce our results we let and denote the sum of the lower and upper bounds on the variable length gaps in , respectively. Hence and denote the number of normal and optional wildcards in , respectively. A wildcard index with support for optional wildcards is called an optional wildcard index. As for wildcard indexes, we distinguish between bounded and unbounded optional wildcard indexes. A -bounded optional wildcard index supports patterns containing normal wildcards and optional wildcards. An unbounded optional wildcard index supports patterns with no restriction on the number of normal and optional wildcards.

To accommodate for variable length gaps in the pattern, we only need to modify the way in which the wildcard indexes are searched, leading to the following new theorems. The proofs are given in .

These results completely generalize our previous solutions, since if the query pattern only contains variable length gaps of the form { 1, 1 }, the problem reduces to string indexing for patterns with wildcards. In that case and we obtain exactly , and .

Compared to the only known index for the problem by Lam et al. [24], gives an unbounded optional wildcard index that matches the space usage, but improves the worst-case query time from to , provided that .

2Preliminaries

We introduce the following notation. Let be a pattern consisting of strings (subpatterns) interleaved by wildcards. The substring starting at position in is an occurrence of if and only if each subpattern matches the corresponding substring in . That is,

where denotes the substring of between indices and , both inclusive. We define for , for and for . Furthermore is the number of characters in , and we assume without loss of generality that and .

Let and denote the prefix and suffix of of length and , respectively. Omitting the subscripts, we let and denote the set of all non-empty prefixes and suffixes of , respectively. We extend the definitions of prefix and suffix to sets of strings as follows.

A set of strings is prefix-free if no string in is a prefix of another string in . Any string set can be made prefix-free by appending the same unique character to each string in .

2.1Trees and Tries

For a tree , the root is denoted , while is the number of edges on a longest path from to a leaf of . A compressed trie is a tree storing a prefix-free set of strings . The edges are labeled with substrings of the strings in , such that a path from the root to a leaf corresponds to a unique string in . All internal vertices (except the root) have at least two children, and all labels on the outgoing edges of a vertex have different initial characters.

A location may refer to either a vertex or a position on an edge in . Formally, where is a vertex in and is a prefix of the label on an outgoing edge of . If , we also refer to as an explicit vertex, otherwise is called an implicit vertex. There is a one-to-one mapping between locations in and unique prefixes in . The prefix corresponding to a location is obtained by concatenating the edge labels on the path from to . Consequently, we use and interchangeably, and we let denote the length of . Since is assumed prefix-free, each leaf of is a string in , and conversely. The suffix tree for denotes the compressed trie over all suffixes of , i.e., . We define as the subtrie of rooted at . That is, contains the suffixes of strings in starting from . Formally, , where

2.2Heavy Path Decomposition

For a vertex in a rooted tree , we define to be the number of leaves in , where denotes the subtree rooted at . We define . The heavy path decomposition of , introduced by Harel and Tarjan [20], classifies each edge as either light or heavy. For each vertex , we classify the edge going from to its child of maximum weight (breaking ties arbitrarily) as heavy. The remaining edges are light. This construction has the property that on a path from the root to any vertex, heavy paths are traversed. For a heavy path decomposition of a compressed trie , we assume that the heavy paths are extended such that the label on each light edge contains exactly one character.

3The LCP Data Structure

Cole et al. [11] introduced the the Longest Common Prefix (LCP) data structure, which provides a way to traverse a compressed trie without tracing the query string one character at a time. In this section we give a brief, self-contained description of the data structure and show a new property that is essential for obtaining .

The LCP data structure stores a collection of compressed tries over the string sets . Each is a set of substrings of the indexed string . The purpose of the LCP data structure is to support LCP queries

If is the root of , we refer to the above LCP query as a rooted LCP query. Otherwise the query is called an unrooted LCP query. In addition to the compressed tries , the LCP data structure also stores the suffix tree for , denoted where . The following lemma is implicit in the paper by Cole et al. [11].

We extend the LCP data structure by showing that support for slower unrooted LCP queries on a compressed trie can be added using linear additional space.

We initially create a heavy path decomposition for all compressed tries . The search path for starting in traverses a number of heavy paths in . Intuitively, an unrooted LCP query can be answered by following the heavy paths that the search path passes through. For each heavy path, the next heavy path can be identified in constant time. On the final heavy path, a predecessor query is needed to determine the exact location where the search path stops.

For a heavy path , we let denote the distance that the search path for follows . Cole et al. [11] showed that can be determined in constant time by performing nearest common ancestor queries on . To answer we identify the heavy path of that is part of and compute the distance as described by Cole et al. If leaves on a light edge, indexing distance into from yields an explicit vertex . At , a constant time lookup for determines the light edge on which leaves . Since the light edge has a label of length one, the next location on that edge is the root of the next heavy path. We continue the search for the remaining suffix of from recursively by a new unrooted LCP query . If is the heavy path on which the search for stops, the location at distance (i.e., the answer to the original LCP query) is not necessarily an explicit vertex, and may not be found by indexing into . In that case a predecessor query for is performed on to determine the preceding explicit vertex and thereby the location . Answering an unrooted LCP query entails at most recursive steps, each taking constant time. The final recursive step may require a predecessor query taking time . Consequently, an unrooted LCP query can be answered in time using additional space to store the predecessor data structures for each heavy path.

4An Unbounded Wildcard Index Using Linear Space

In this section we show how to obtain by applying an ART decomposition on the suffix tree for and storing the top and bottom trees in the LCP data structure.

4.1ART Decomposition

The ART decomposition introduced by Alstrup et al. [1] decomposes a tree into a single top tree and a number of bottom trees. The construction is defined by two rules:

  1. A bottom tree is a subtree rooted in a vertex of minimal depth such that the subtree contains no more than leaves.

  2. Vertices that are not in any bottom tree make up the top tree.

The decomposition has the following key property.

4.2Obtaining the Index

Applying an ART decomposition on with , we obtain a top tree and a number of bottom trees each of size at most . From , has at most leaves and hence vertices since is a compressed trie.

To facilitate the search, the top and bottom trees are stored in an LCP data structure, noting that these compressed tries only contain substrings of . Using , we add support for unrooted time LCP queries on the bottom trees using additional space in total. For the top tree we apply to add support for unrooted LCP queries in time using additional space. Since the branching factor is not reduced, LCP queries, each taking time , are performed for the subpattern . This concludes the proof of .

5A Time-Space Trade-Off for -Bounded Wildcard Indexes

In this section we will show . We first introduce the necessary constructions.

5.1Heavy -Tree Decomposition

The heavy -tree decomposition is a generalization of the well-known heavy path decomposition introduced by Harel and Tarjan [20]. The purpose is to decompose a rooted tree into a number of heavy trees joined by light edges, such that a path to the root of traverses at most a logarithmic number of heavy trees. For use in the construction, we define a proper weight function on the vertices of , to be a function satisfying Observe that using the number of vertices or the number of leaves in the subtree rooted at as the weight of satisfies this property. The decomposition is then constructed by classifying edges in as being heavy or light according to the following rule. For every vertex , the edges to the heaviest children of (breaking ties arbitrarily) are heavy, and the remaining edges are light. For this results in a heavy path decomposition. Given a heavy -tree decomposition of , we define to be the number of light edges on a path from the vertex to the root of . The key property of this construction is captured by the following lemma.

Consider a light edge from a vertex to its child . We prove that , implying that . To obtain a contradiction, suppose that . In addition to , must have heavy children, each of which has a weight greater than or equal to . Hence

which is a contradiction.

holds for any heavy -tree decomposition obtained using a proper weight function on . In the remaining part of the paper we will assume that the weight of a vertex is the number of leaves in the subtree rooted at . See for two different examples of heavy -tree decompositions.

We define to be the maximum light depth of a vertex in , and remark that for , . For a vertex in a compressed trie , we let denote the set of strings starting in one of the light edges leaving . That is, is the union of the set of strings in the subtries where is the first location on a light outgoing edge of , i.e., .

5.2Wildcard Trees

We introduce the -wildcard tree, denoted , where is a chosen parameter. This data structure stores a collection of strings in a compressed trie such that the search for a pattern with at most wildcards branches to at most locations in when consuming a single wildcard of . In particular for , the search for never branches and the search time becomes linear in the length of . For a vertex , we define the wildcard height of to be the number of wildcards on the path from to the root. Intuitively, given a wildcard tree that supports wildcards, support for an extra wildcard is added by joining a new tree to each vertex with wildcard height by an edge labeled . This tree is searched if a wildcard is consumed in . Formally, is built recursively as follows.

Construction of

: Produce a heavy -tree decomposition of , then for each internal vertex join to the root of by an edge labeled . Let .

The construction is illustrated in . Since a leaf in a compressed trie is obtained as the suffix of a string , we assume that inherits the label of in case the strings in are labeled. For example, when denotes the suffixes of , we will label each suffix in with its start position in . This immediately provides us with a -bounded wildcard index. shows some concrete examples of the construction of when is a set of labeled suffixes.

Illustrating of the recursive construction of the wildcard tree T_\beta^k(C^\prime). The final tree consists of k layers of compressed tries joined by edges labeled \ast.
Illustrating of the recursive construction of the wildcard tree . The final tree consists of layers of compressed tries joined by edges labeled .

5.3Wildcard Tree Index

Given a collection of strings and a pattern , we can identify the strings of having a prefix matching by constructing . Searching is similar to the suffix tree search, except when consuming a wildcard character of in an explicit vertex with more than children. In that case the search branches to the root of the wildcard tree joined to and to the first location on the heavy edges of , effectively letting the wildcard match the first character on all edges from . Consequently, the search for branches to a total of at most locations, each of which requires time, resulting in a query time . For the query time is .

We prove that the total number of strings (leaves) in , denoted , is at most . The proof is by induction on . The base case holds, since contains strings. For the inductive step, assume that . Let for a vertex . From the construction we have that the number of strings in is the number of strings in plus the number of strings in the wildcard trees joined to the vertices of . That is,

The string sets consist of suffixes of strings in . Consider a string , i.e., a leaf in . The number of times a suffix of appears in a set is equal to the light depth of in . is also a set of suffixes of , and hence is an upper bound on the maximum light depth of . This establishes that thus showing that .

Constructing the wildcard tree , where , we obtain a wildcard index with the following properties.

The query time follows from . Since is a compressed trie, and because each edge label is a substring of , the space needed to store is upper bounded by the number of strings it contains which by is . It follows from that is an upper bound on the light height of all compressed tries , since they each contain at most vertices. Consequently, the space needed to store the index is .

5.4Wildcard Tree Index Using the LCP Data Structure

The wildcard index of reduces the branching factor of the suffix tree search from to , but still has the drawback that the search for a subpattern from a location takes time. This can be addressed by combining the index with the LCP data structure as in Cole et al. [11]. In that way, the search for a subpattern can be done in time . The index is obtained by modifying the construction of such that each is added to the LCP data structure prior to joining the -wildcard trees to the vertices of . For all except the final , support for unrooted LCP queries in time is added using additional space. For the final , searched when all wildcards have been matched, we only need support for rooted queries. Upon receiving the query pattern , each is preprocessed in time to support LCP queries for any suffix of . The search for proceeds as described for the normal wildcard tree, except now rooted and unrooted LCP queries are used to search for suffixes of .

In the search for , a total of at most LCP queries, each taking time , are performed. Preprocessing takes time, so the query time is . The space needed to store the index is for plus the space needed to store the LCP data structure.

Adding support for rooted LCP queries requires linear space in the total size of the compressed tries, i.e., . Let denote the compressed tries with support for unrooted LCP queries. Since each contains at most strings and , by , the additional space required to support unrooted LCP queries is

which is an upper bound on the total space required to store the wildcard index. This concludes the proof of . The -bounded wildcard index described by Cole et al. [11] is obtained as a special case of .

6A -Bounded Wildcard Index with Linear Query Time

Consider the -bounded wildcard index obtained by creating the wildcard tree for . This index has linear query time, and we can show that the space usage depends of the height of the suffix tree.

Since is closed under the suffix operation, the height of is an upper bound on the height of all compressed tries satisfying for some . For , the light height of is equal to the height of , so can be used as an upper bound of the light height in , and consequently the space needed to store is .

In the worst case the height of the suffix tree is close to , but combining the index with another wildcard index yields a useful black box reduction. The idea is to query the first index if the pattern is short, and the second index if the pattern is long.

The wildcard index consists of as well as a special wildcard index , which is a wildcard tree with over the set of all substrings of of length . can be used as an upper bound for the light height in , so the space required to store is by using if . A query on results in a query on either or . In case , we query and the query time will be . In case , we query with query time . In any case the query time of is .

Applying with and on the unbounded wildcard index from yields a new -bounded wildcard index with linear query time using space . This concludes the proof of .

7Variable Length Gaps

We now consider the string indexing for patterns with variable length gaps problem. By only changing the search procedure, this problem can be solved using the previously described bounded and unbounded wildcard indexes.

The string indexing for patterns with variable length gaps problem is to build an index for a string that can efficiently report the occurrences of a query pattern of the form

The query pattern consists of strings interleaved by variable length gaps { a_i, b_i }, , where and are positive integers such that . Intuitively, a variable length gap { a_i, b_i } matches an arbitrary string over of length between and , both inclusive.

The five occurrences of the query pattern p= \texttt{b} \mbox{{ 0, 4 }} \texttt{cc} \mbox{{ 3, 5 }} \texttt{d} in the string t=\texttt{acbccbacccddabdaabcdccbccdaa}.
The five occurrences of the query pattern { 0, 4 }{ 3, 5 } in the string .

As shown by , different occurrences of the query pattern can start or end at the same position in , and the same substring in can contain multiple occurrences of . Hence to completely characterize an occurrence of in , we need to report the positions of the individual subpatterns for each full occurrence of the pattern. However, in the following we will restrict our attention to reporting the start and end position of each occurrence of in . For the above example, we would thus report the pairs , , and .

7.1Supporting Variable Length Gaps

Recall that a variable length gap { a_i, b_i } is equivalent to wildcards followed by optional wildcards. Hence to support variable length gaps, we only have to describe how the search algorithms must be modified to match an optional wildcard in . We simulate an optional wildcard as matching both a normal wildcard and the empty string. When matching a normal wildcard the search can only branch in explicit vertices, but for optional wildcards the search will always branch to at least two locations. This is the reason for the factor in the query times of – ?.

To report the substrings in where the query pattern occurs, we assume that each leaf in has been labeled by the start position, , of the suffix in it corresponds to. The search for terminates in a set of locations , each corresponding to one or more substrings in where the query pattern occurs. We can report the start and end position of these substrings by traversing the subtrees rooted in the locations of . For a subtree rooted in we identify the leaves corresponding to suffixes of having as a prefix. The start and end positions of these substrings are then given by

7.2Analysis of the Modified Search

To analyse the query time we bound the maximum number of LCP queries performed during the search for the query pattern

We define and . The number of normal and optional wildcards preceding the subpattern in is and , respectively. To bound the number of locations in which an LCP query for the subpattern can start, we choose and promote of the preceding optional wildcards to normal wildcards and discard the rest. For a specific choice there are exactly wildcards preceding , and thus the number of locations in which an LCP query for can start is at most . The term is an upper bound on the branching factor of the search when consuming a wildcard. For a suffix tree the branching factor is , but indexes based on wildcard trees can have a smaller branching factor. There are possibilities for choosing the optional wildcards, so the number of locations in which an LCP query for can start is at most

Summing over the subpatterns, we obtain a bound of on the number of LCP queries performed during a search for the query pattern . Since LCP queries are performed in time and we have to preprocess the pattern in time , the total query time becomes . This concludes the proof of and .

To show , we apply a black-box reduction very similar to , leading to a -bounded optional wildcard index, where and are the maximum number of normal and optional wildcards allowed in the pattern, respectively. This index consists of the following two optional wildcard indexes. A query is performed on one of these indexes depending on the length of the query pattern .

  1. The unbounded optional wildcard index given by . This index has query time and uses space .

  2. The -bounded optional wildcard index obtained by using the wildcard tree without the LCP data structure, where . For the search for the subpattern can start from at most locations. Searching for from each of these locations takes time , since the LCP data structure is not used and the tree must be traversed one character at a time. Summing over the subpatterns, we obtain the following query time for the index

    The index is a wildcard tree and by the same argument as for , it can be stored using space .

In case the query pattern has length we query the first index. It follows that , so the query time is . If has length all occurrences of in can be found by querying the second index in time . The space of the index is

This concludes the proof of .

8Conclusion

We have presented several new indexes supporting patterns containing wildcards and variable length gaps. All previous wildcard indexes have query times which are either exponential in the number of wildcards or gaps in the pattern, or linear in the length of the indexed text. We showed that it is possible to obtain an index with linear query time while avoiding space usage exponential in the length of the indexed string. Moreover, we gave an index with linear space usage and a fast query time. For wildcard indexes having a query time sublinear in the length of the indexed string, an interesting open problem is whether there is an index where neither the size nor the query time is exponential in the number of wildcards or gaps in the pattern.

References

  1. Marked ancestor problems.
    S. Alstrup, T. Husfeldt, and T. Rauhe. In Proc. 39th FOCS, pages 534–543, 1998.
  2. Faster algorithms for string matching with k mismatches.
    A. Amir, M. Lewenstein, and E. Porat. In Proc. 11th SODA, pages 794–803, 2000.
  3. Substring Range Reporting.
    P. Bille and I. L. Gørtz. In Proc. 22nd CPM, pages 299–308, 2011.
  4. String matching with variable length gaps.
    P. Bille, I. L. Gørtz, H. Vildhøj, and D. Wind. In Proc. 17th SPIRE, pages 385–394, 2010.
  5. A generalized profile syntax for biomolecular sequence motifs and its function in automatic sequence interpretation.
    P. Bucher and A. Bairoch. In Proc. 2nd ISMB, pages 53–61, 1994.
  6. A linear size index for approximate pattern matching.
    H. L. Chan, T. W. Lam, W. K. Sung, S. L. Tam, and S. S. Wong. J. Disc. Algorithms, 9(4):358–364, 2011.
  7. Filtering search: A new approach to query-answering.
    B. Chazelle. SIAM J. Comput., 15(3):703–724, 1986.
  8. Efficient string matching with wildcards and length constraints.
    G. Chen, X. Wu, X. Zhu, A. Arslan, and Y. He. Knowl. Inf. Sys., 10(4):399–419, 2006.
  9. Simple deterministic wildcard matching.
    P. Clifford and R. Clifford. Inf. Process. Lett., 101(2):53–54, 2007.
  10. Dotted suffix trees a structure for approximate text indexing.
    L. Coelho and A. Oliveira. In Proc. 13th SPIRE, pages 329–336, 2006.
  11. Dictionary matching and indexing with errors and don’t cares.
    R. Cole, L. Gottlieb, and M. Lewenstein. In Proc. 36th STOC, pages 91–100, 2004.
  12. Approximate string matching: A simpler faster algorithm.
    R. Cole and R. Hariharan. In Proc. 9th SODA, pages 463–472, 1998.
  13. Verifying candidate matches in sparse and wildcard matching.
    R. Cole and R. Hariharan. In Proc. 34rd STOC, pages 592–601, 2002.
  14. String-Matching and Other Products.
    M. J. Fischer and M. S. Paterson. In Complexity of Computation, SIAM-AMS Proceedings, pages 113–125, 1974.
  15. Storing a Sparse Table with O(1) Worst Case Access Time.
    M. L. Fredman, J. Komlós, and E. Szemerédi. J. ACM, 31:538–544, 1984.
  16. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance.
    K. Fredriksson and S. Grabowski. Inf. Retr., 11(4):335–357, 2008.
  17. Efficient algorithms for pattern matching with general gaps, character classes, and transposition invariance.
    K. Fredriksson and S. Grabowski. Inf. Retr., 11(4):335–357, 2008.
  18. Nested counters in bit-parallel string matching.
    K. Fredriksson and S. Grabowski. Proc. 3rd LATA, pages 338–349, 2009.
  19. Improved string matching with k mismatches.
    Z. Galil and R. Giancarlo. ACM SIGACT News, 17(4):52–54, 1986.
  20. Fast algorithms for finding nearest common ancestors.
    D. Harel and R. Tarjan. SIAM J. Comput., 13(2):338–355, 1984.
  21. The PROSITE database, its status in 1999.
    K. Hofmann, P. Bucher, L. Falquet, and A. Bairoch. Nucleic Acids Res., 27(1):215–219, 1999.
  22. Pattern matching algorithms with don’t cares.
    C. S. Iliopoulos and M. S. Rahman. In Proc. 33rd SOFSEM, pages 116–126, 2007.
  23. Efficient pattern-matching with don’t cares.
    A. Kalai. In Proc. 13th SODA, pages 655–656, 2002.
  24. Space efficient indexes for string matching with don’t cares.
    T. W. Lam, W. K. Sung, S. L. Tam, and S. M. Yiu. In Proc. 18th ISAAC, pages 846–857, 2007.
  25. Efficient string matching with k mismatches.
    G. Landau and U. Vishkin. Theoret. Comput. Sci., 43:239–249, 1986.
  26. Fast parallel and serial approximate string matching.
    G. Landau and U. Vishkin. J. Algorithms, 10(2):157–169, 1989.
  27. Indexing with gaps.
    M. Lewenstein. In Proc. 18th SPIRE, pages 135–143, 2011.
  28. Text indexing with errors.
    M. Maas and J. Nowak. J. Disc. Algorithms, 5(4):662–681, 2007.
  29. A system for pattern matching applications on biosequences.
    G. Mehldau and G. Myers. CABIOS, 9(3):299–314, 1993.
  30. Structured motifs search.
    M. Morgante, A. Policriti, N. Vitacolonna, and A. Zuccolo. J. Comput. Bio., 12(8):1065–1082, 2005.
  31. Approximate matching of network expressions with spacers.
    E. Myers. J. Comput. Bio., 3(1):33–51, 1996.
  32. Indexing methods for approximate string matching.
    G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. IEEE Data Eng. Bull., 24(4):19–27, 2001.
  33. Fast and simple character classes and bounded gaps pattern matching, with applications to protein searching.
    G. Navarro and M. Raffinot. J. Comput. Bio., 10(6):903–923, 2003.
  34. Finding patterns with variable length gaps or don’t cares.
    M. S. Rahman, C. S. Iliopoulos, I. Lee, M. Mohamed, and W. F. Smyth. In Proc. 12th COCOON, pages 146–155, 2006.
  35. Efficient approximate and dynamic matching of patterns using a labeling paradigm.
    S. Sahinalp and U. Vishkin. In Proc. 37th FOCS, pages 320–328, 1996.
  36. Succinct text indexing with wildcards.
    A. Tam, E. Wu, T. Lam, and S. Yiu. In Proc. 16th SPIRE, pages 39–50, 2009.
  37. Fast index for approximate string matching.
    D. Tsur. J. Disc. Algorithms, 8(4):339–345, 2010.
  38. Linear pattern matching algorithms.
    P. Weiner. In Proc. 14th SWAT, pages 1–11, 1973.
10173
This is a comment super asjknd jkasnjk adsnkj
""
The feedback cannot be empty
Submit
Cancel
Comments 0
""
The feedback cannot be empty
   
Add comment
Cancel

You’re adding your first comment!
How to quickly get a good reply:
  • Offer a constructive comment on the author work.
  • Add helpful links to code implementation or project page.