A Dichotomy for Regular Expression Membership Testing

A Dichotomy for Regular Expression Membership Testing

Karl Bringmann Max Planck Institute for Informatics, Saarland Informatics Campus. Email: kbringma@mpi-inf.mpg.de    Allan Grønlund Aarhus University. Email: jallan@cs.au.dk. Supported by Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, grant DNRF84.    Kasper Green Larsen Aarhus University. Email: larsen@cs.au.dk. Supported by Center for Massive Data Algorithmics, a Center of the Danish National Research Foundation, grant DNRF84, a Villum Young Investigator Grant and an AUFF Starting Grant.
Abstract

We study regular expression membership testing: Given a regular expression of size and a string of size , decide whether the string is in the language described by the regular expression. Its classic algorithm is one of the big success stories of the 70s, which allowed pattern matching to develop into the standard tool that it is today.

Many special cases of pattern matching have been studied that can be solved faster than in quadratic time. However, a systematic study of tractable cases was made possible only recently, with the first conditional lower bounds reported by Backurs and Indyk [FOCS’16]. Restricted to any “type” of homogeneous regular expressions of depth 2 or 3, they either presented a near-linear time algorithm or a quadratic conditional lower bound, with one exception known as the Word Break problem.

In this paper we complete their work as follows:

  • We present two almost-linear time algorithms that generalize all known almost-linear time algorithms for special cases of regular expression membership testing.

  • We classify all types, except for the Word Break problem, into almost-linear time or quadratic time assuming the Strong Exponential Time Hypothesis. This extends the classification from depth 2 and 3 to any constant depth.

  • For the Word Break problem we give an improved algorithm. Surprisingly, we also prove a matching conditional lower bound for combinatorial algorithms. This establishes Word Break as the only intermediate problem.

In total, we prove matching upper and lower bounds for any type of bounded-depth homogeneous regular expressions, which yields a full dichotomy for regular expression membership testing.

1 Introduction

A regular expression is a term involving an alphabet and the operations concatenation , union , Kleene’s star , and Kleene’s plus , see Section 2. In regular expression membership testing, we are given a regular expression and a string and want to decide whether is in the language described by . In regular expression pattern matching, we instead want to decide whether any substring of is in the language described by . A big success story of the 70s was to show that both problems have time algorithms [17], where is the length of the string and is the size of . This quite efficient running time, coupled with the great expressiveness of regular expressions, made pattern matching the standard tool that it is today.

Despite the efficient running time of , it would be desirable to have even faster algorithms. A large body of work in the pattern matching community was devoted to this goal, improving the running time by logarithmic factors [15, 5] and even to near-linear for certain special cases [14, 7, 2].

A systematic study of the complexity of various special cases of pattern matching and membership testing was made possible by the recent advances in the field of conditional lower bounds, where tight running time lower bounds are obtained via fine-grained reductions from certain core problems like satisfiability, all-pairs-shortest-paths, or 3SUM (see, e.g., [9, 22, 1, 16]). Many of these conditional lower bounds are based on the Strong Exponential Time Hypothesis (SETH) [12] which asserts that -satisfiability has no time algorithm for any and all .

The first conditional lower bounds for pattern matching problems were presented by Backurs and Indyk [4]. Viewing a regular expression as a tree where the inner nodes are labeled by , and and the leafs are labeled by alphabet symbols, they call a regular expression homogeneous of type if in each level of the tree all inner nodes have type , and the depth of the tree is at most . Note that leafs may appear in any level, and the degrees are unbounded. This gives rise to natural restrictions -pattern matching and -membership, where we require the regular expression to be homogeneous of type . The main result of Backurs and Indyk [4] is a characterization of -pattern matching for all types of depth : For each such problem they either design a near-linear time algorithm or show a quadratic lower bound based on SETH. We observed that the results by Backurs and Indyk actually even yield a classification for all , not only for depth . This is not explicitly stated in [4], so for completeness we prove it in this paper, see Appendix A. This closes the case for -pattern matching.

For -membership, Backurs and Indyk also prove a classification into near-linear time and “SETH-hard” for depth , with the only exception being -membership. The latter problem is also known as the Word Break problem, since it can be rephrased as follows: Given a string and a dictionary , can be split into words contained in ? Indeed, a regular expression of type represents a string, so a regular expression of type represents a dictionary, and type then asks whether a given string can be split into dictionary words. A relatively easy algorithm solves the Word Break problem in randomized time , which Backurs and Indyk improved to randomized time . Thus, the Word Break problem is the only studied special case of membership testing (or pattern matching) for which no near-linear time algorithm or quadratic time hardness is known. In particular, no other special case is “intermediate”, i.e., in between near-linear and quadratic running time. Besides the status of Word Break, Backurs and Indyk also leave open a classification for .

1.1 Our Results

In this paper, we complete the dichotomy started by Backurs and Indyk [4] to a full dichotomy for any depth . In particular, we (conditionally) establish Word Break as the only intermediate problem for (bounded-depth homogeneous) regular expression membership testing. More precisely, our results are as follows.

Word Break Problem.

We carefully study the only depth-3 problem left unclassified by Backurs and Indyk. In particular, we improve Backurs and Indyk’s randomized algorithm to a deterministic algorithm.

Theorem 1.

The Word Break problem can be solved in time .

We remark that often running times of the form stem from a tradeoff of two approaches to a problem. Analogously, our time stems from trading off three approaches.

Very surprisingly, we also prove a matching conditional lower bound. Our result only holds for combinatorial algorithms, which is a notion without agreed upon definition, intuitively meaning that we forbid unpractical algorithms such as fast matrix multiplication. We use the following hypothesis; a slightly weaker version has also been used in [3] for context free grammar parsing. Recall that the -Clique problem has a trivial time algorithm, an combinatorial algorithm [18], and all known faster algorithms use fast matrix multiplication [8].

Conjecture 1.

For all , any combinatorial algorithm for -Clique takes time .

We provide a (combinatorial) reduction from -Clique to the Word Break problem showing:

Theorem 2.

Assuming Conjecture 1, the Word Break problem has no combinatorial algorithm in time for any .

This is a surprising result for multiple reasons. First, is a very uncommon time complexity, specifically we are not aware of any other problem where the fastest known algorithm has this running time. Second, it shows that the Word Break problem is an intermediate problem for -membership, as it is neither solvable in almost-linear time nor does it need quadratic time. Our results below show that the Word Break problem is, in fact, the only intermediate problem for -membership, which is quite fascinating.

We leave it as an open problem to prove a matching lower bound without the assumption of “combinatorial”. Related to this question, note that the currently fastest algorithm for 4-Clique is based on fast rectangular matrix multiplication and runs in time  [8, 10]. If this bound is close to optimal, then we can still establish Word Break as an intermediate problem (without any restriction to combinatorial algorithms).

Theorem 3.

For any , if 4-Clique has no algorithm, then Word Break has no algorithm for .

We remark that this situation of having matching conditional lower bounds only for combinatorial algorithms is not uncommon, see, e.g., Sliding Window Hamming Distance [6].

New Almost-Linear Time Algorithms.

We establish two more types for which the membership problem is in almost-linear time.

Theorem 4.

We design a deterministic algorithm for -membership and an expected time algorithm for -membership. These algorithms also work for -membership for any subsequence of or , respectively.

This generalizes all previously known almost-linear time algorithms for any -membership problem, as all such types are proper subsequences of or . Moreover, no further generalization of our algorithms is possible, as shown below.

Dichotomy.

We enhance the classification of -membership started by Backurs and Indyk for to a complete dichotomy for all types . To this end, we first establish the following simplification rules.

Lemma 1.

For any type , applying any of the following rules yields a type such that -membership and -membership are equivalent under linear-time reductions:

  1. replace any substring , for any , by ,

  2. replace any substring by ,

  3. replace prefix by for any .

We say that -membership simplifies if one of these rules applies. Applying these rules in any order will eventually lead to an unsimplifiable type.

We show the following dichotomy. Note that we do not have to consider simplifying types, as they are equivalent to some unsimplifiable type.

Theorem 5.

For any one of the following holds:

  • -membership simplifies,

  • is a subsequence of or , and thus -membership is in almost-linear time (by Theorem 4),

  • , and thus -membership is the Word Break problem taking time (by Theorems 1 and 2, assuming Conjecture 1), or

  • -membership takes time , assuming SETH.

This yields a complete dichotomy for any constant depth . We discussed the algorithmic results and the results for Word Break before. Regarding the hardness results, Backurs and Indyk [4] gave SETH-hardness proofs for -membership on types , , , , and . We provide further SETH-hardness for types , , and . To get from these (hard) core types to all remaining hard types, we would like to argue that all hard types contain one of the core types as a subsequence and thus are at least as hard. However, arguing about subsequences fails in general, since the definition of “homogeneous with type ” does not allow to leave out layers. This makes it necessary to proceed in a more ad-hoc way.

In summary, we provide matching upper and lower bounds for any type of bounded-depth homogeneous regular expressions, which yields a full dichotomy for the membership problem.

1.2 Organization

The paper is organized as follows. We start with preliminaries in Section 2. For the Word Break problem, we prove the conditional lower bounds in Section 3, followed by the matching upper bound in Section 4. In Section 5 we present our new almost-linear time algorithms for two types, and in Section 6 we prove SETH-hardness for three types. Finally, in Section 7 we prove that our results yield a full dichotomy for homogenous regular expression membership testing.

2 Preliminaries

A regular expression is a tree with leafs labelled by symbols in an alphabet and inner nodes labelled by (at least one child), (at least one child), (exactly one child), or (exactly one child)111All our algorithms work in the general case where and may have degree 1. For the conditional lower bounds, it may be unnatural to allow degree 1 for these operations. If we restrict to degrees at least 2, it is possible to adapt our proofs to prove the same results, but this is tedious and we think that the required changes would be obscuring the overall point. We discuss this issue in more detail when it comes up, see footnote 2 on page 2.. The language described by the regular expression is recursively defined as follows. A leaf labelled by describes the language , consisting of one word of length 1. Consider an inner node with children . If is labelled by then it describes the language , i.e., all concatenations of strings in the children’s languages. If is labelled by then it describes the language . If is labelled , then its degree must be 1 and it describes the language , and if is labelled then the same statement holds with “” replaced by “”. we say that a string matches a regular expression if is in the language described by .

We use the following definition from [4]. We let be the set of all finite sequences over ; we also call this the set of types. For any we denote its length by and its -th entry by . We say that a regular expression is homogeneous of type if it has depth at most (i.e., any inner node has level in ), and for any , any inner node in level is labelled by . We also say that the type of any inner node at level is . This does not restrict the appearance of leafs in any level.

Definition 1.

A linear-time reduction from -membership to -membership is an algorithm that, given a regular expression of type and length and a string of length , in total time outputs a regular expression of type and size , and a string of length such that matches if and only if matches .

The Strong Exponential Time Hypothesis (SETH) was introduced by Impagliazzo, Paturi, and Zane [13] and is defined as follows.

Conjecture 2.

For no , k-SAT can be solved in time for all .

Very often it is easier to show SETH-hardness based on the intermediate problem Orthogonal Vectors (OV): Given two sets of -dimensinal vectors with , determine if there exist vectors such that . The following OV-conjecture follows from SETH [21].

Conjecture 3.

For any there is no algorithm for OV that runs in time .

To start off the proof for the dichotomy, we have the following hardness results from [4].

Theorem 6.

For any type among , , , , and , any algorithm for -membership takes time unless SETH fails.

3 Conditional Lower Bound for Word Break

In this section we prove our conditional lower bounds for the Word Break problem, Theorems 2 and 3. Both theorems follow from the following reduction.

Theorem 7.

For any , given a -Clique instance on vertices, we can construct an equivalent Word Break instance on a string of length and a dictionary of total size . The reduction is combinatorial and runs in linear time in the output size.

First let us see how this reduction implies Theorems 2 and 3.

Proof of Theorem 2.

Suppose for the sake of contradiction that Word Break can be solved combinatorially in time . Then our reduction yields a combinatorial algorithm for -Clique in time , contradicting Conjecture 1. This proves the theorem. ∎

Proof of Theorem 2.

Assuming that 4-Clique has no algorithm for some , we want to show that Word Break has no algorithm for .

Setting in the above reduction yields a string and a dictionary, both of size (which can be padded to the same size). Thus, an algorithm for Word Break with would yield an algorithm for 4-Clique, contradicting the assumption. ∎

It remains to prove Theorem 7. Let be an -node graph on which we want to determine whether there is a -clique. The main idea of our reduction is to construct a gadget that for any -clique can determine whether there are two nodes such that and both and are connected to all nodes in , i.e., forms a -clique in . For intuition, we first present a simplified version of our gadgets and then show how to modify them to obtain the final reduction.

Simplified Neighborhood Gadget.

Given a -clique , the purpose of our first gadget is to test whether there is a node that is connected to all nodes in . Assume the nodes in are denoted . The alphabet over which we construct strings has a symbol for each . Furthermore, we assume has special symbols and . The simplified neighborhood gadget for has the text being

and the dictionary contains for every edge , the string:

and for every node , the two strings

and

The idea of the above construction is as follows: Assume we want to break into words. The crucial observation is that to match using , we have to start with for some node . The only way we can possibly match the following part is if has the string . But this is the case if and only if , i.e. is a neighbor of . If indeed this is the case, we have now matched the prefix of the next block. This means that we can still only use strings starting with from . Repeating this argument for all , we conclude that we can break into words from if and only if there is some node that is a neighbor of every node .

Simplified -Clique Gadget.

With our neighborhood gadget in mind, we now describe the main ideas of our gadget that for a given -clique can test whether there are two nodes such that and and are both connected to all nodes of , i.e., forms a -clique.

Let denote the text used in the neighborhood gadget for , i.e.

Our -clique gadget for has the following text :

where is a special symbol in . The dictionary has the strings mentioned in the neighborhood gadget, as well as the string

for every edge . The idea of this gadget is as follows: Assume we want to break into words. We have to start using the dictionary string for some node . For such a candidate node , we can match the prefix

of if and only if is a neighbor of every node in . Furthermore, the only way to match this prefix (if we start with ) covers precisely the part:

Thus if we want to also match the , we can only use strings

for an edge . Finally, by the second neighborhood gadget, we can match the whole string if and only if there are some nodes such that is a neighbor of every node in (we can match the first ), and (we can match the ) and is a neighbor of every node in (we can match the second ), i.e., forms a -clique.

Combining it all.

The above gadget allows us to test for a given -clique whether there are some two nodes and we can add to to get a -clique. Thus, our next step is to find a way to combine such gadgets for all the -cliques in the input graph. The challenge is to compute an OR over all of them, i.e. testing whether at least one can be extended to a -clique. For this, our idea is to replace every symbol in the above constructions with 3 symbols and then carefully concatenate the gadgets. When we start matching the string against the dictionary, we are matching against the first symbol of the first -clique gadget, i.e. we start at an offset of zero. We want to add strings to the dictionary that always allow us to match a clique gadget if we have an offset of zero. These strings will then leave us at offset zero in the next gadget. Next, we will add a string that allow us to change from offset zero to offset one. We will then ensure that if we have an offset of one when starting to match a -clique gadget, we can only match it if that clique can be extended to a -clique. If so, we ensure that we will start at an offset of two in the next gadget. Next, we will also add strings to the dictionary that allow us to match any gadget if we start at an offset of two, and these strings will ensure we continue to have an offset of two. Finally, we append symbols at the end of the text that can only be matched if we have an offset of two after matching the last gadget. To summarize: Any breaking of into words will start by using an offset of zero and simply skipping over -cliques that cannot be extended to a -clique. Then once a proper -clique is found, a string of the dictionary is used to change the start offset from zero to one. Finally, the clique is matched and leaves us at an offset of two, after which the remaining string is matched while maintaining the offset of two.

We now give the final details of the reduction. Let be the -node input graph to -clique. We do as follows:

  1. Start by iterating over every set of nodes in . For each such set of nodes, test whether they form a -clique in time. Add each found -clique to a list .

  2. Let and be special symbols in the alphabet. For a string , let

    and

    For each node , add the following two strings to the dictionary :

    and

  3. For each edge , add the following two strings to the dictionary:

    and

  4. For each symbol amongst , add the following two string to :

    and

    Intuitively, the first of these strings is used for skipping a gadget if we have an offset of zero, and the second is used for skipping a gadget if we have an offset of two.

  5. Also add the three strings

    to the dictionary. The first is intuitively used for changing from an offset of zero to an offset of one (begin matching a clique gadget), the second is used for changing from an offset of one to an offset of two in case a clique gadget could be matched, and the last string is used for matching the end of if an offset of two has been achieved.

  6. We are finally ready to describe the text . For a -clique , let be the neighborhood gadget from above, i.e.

    For each (in some arbitrary order), we append the string:

    to the text . Finally, once all these strings have been appended, append another two ’s to . That is, the text is:

We want to show that the text can be broken into words from the dictionary iff there is a -clique in the input graph. Assume first there is a -clique in . Let be an arbitrary subset of nodes from . Since these form a -clique, it follows that has the substring . To match using , do as follows: For each preceeding in , keep using the strings from step 4 above to match. This allows us to match everything preceeding in . Then use the string to match the beginning of . Now let and be the two nodes in . Use the string to match the next part of . Then since is a -clique, we have the string in the dictionary for every . Use these strings for each . Again, since is a -clique, we also have the edge . Thus we can use the string

to match across the in . We then repeat the argument for and repeatedly use the strings to match the second . We finish by using the string followed by using . We are now at an offset where we can repeatedly use to match across all remaining . Finally, we can finish the match by using after the last substring .

For the other direction, assume it is possible to break into words from . By construction, the last word used has to be . Now follow the matching backwards until a string not of the form was used. This must happend eventually since starts with . We are now at a position in where the suffix can be matched by repeatedly using , and then ending with . By construction, has just before this suffix for some . The only string in that could match this without being of the form is the one string . It follows that we must be at the end of some substring and used for matching the last . To match the preceeding in the last , we must have used a string for some . The only strings that can be used preceeding this are strings of the form . Since we have matched , it follows that is in for every . Having traced back the match across the last in , let be the node such that the string was used to match the . It follows that we must have . Tracing the matching through the first in , we conclude that we must also have for every . This establishes that forms a -clique in .

Finishing the proof.

From the input graph , we constructed the Word Break instance in time plus the time needed to output the text and the dictionary. For every edge , we added two strings to , both of length . Furthermore, had two length strings for each node and another strings of constant length. Thus the total length of the strings in is . The text has the substring for every -clique . Thus has length (assuming is constant). The entire reduction takes time for constant . This finishes the reduction and proves Theorem 7.

4 Algorithm for Word Break

In this section we present an algorithm for the Word Break problem, proving Theorem 1. Our algorithm uses many ideas of the randomized algorithm by Backurs and Indyk [4], in fact, it can be seen as a cleaner execution of their main ideas. Recall that in the Word Break Problem we are given a set of strings (the dictionary) and a string (the text) and we want to decide whether can be (-)partitioned, i.e., whether we can write such that for all . We denote the length of by and the total size of by .

We say that we can (-)jump from to if the substring is in . Note that if can be partitioned and we can jump from to then also can be partitioned. Moreover, can be partitioned if and only if there exists such that can be partitioned and we can jump from to . For any power of two , we let .

In the algorithm we want to compute the set of all indices such that can be partitioned (where , since the empty string can be partitioned). The trivial algorithm computes one by one, by checking for each whether for some string in the dictionary we have and , since then we can extend the existing partitioning of by the string to a partitioning of .

In our algorithm, when we have computed the set , we want to compute all possible “jumps” from a point before to a point after using dictionary words with length in (for any power of two ). This gives rise to the following query problem.

Lemma 2.

On dictionary and string , consider the following queries:

  • Jump-Query: Given a power of two , an index in , and a set , compute the set of all such that we can -jump from some to .

We can preprocess in time such that queries of the above form can be answered in time , where is the total size of  and .

Before we prove that jump-queries can be answered in the claimed running time, let us show that this implies an -time algorithm for the Word Break problem.

Proof of Theorem 1.

The algorithm works as follows. After initializing , we iterate over . For any , and any power of two dividing , define . Solve a jump-query on to obtain a set , and set .

To show correctness of the resulting set , we have to show that is in if and only if can be partitioned. Note that whenver we add to then can be partitioned, since this only happens when there is a jump to from some , , which inductively yields a partitioning of . For the other direction, we have to show that whenever can be partitioned then we eventually add to . This is trivially true for the empty string (). For any such that can be partitioned, consider any such that can be partitioned and we can jump from to . Round down to a power of two , and consider any multiple of with . Inductively, we correctly have . Moreover, this holds already in iteration , since after this time we only add indices larger than to . Consider the jump-query for , , and in the above algorithm. In this query, we have and we can jump from to , so by correctness of Lemma 2 the returned set contains . Hence, we add to , and correctness follows.

For the running time, since there are multiples of in , there are invocations of the query algorithm with power of two . Thus, the total time of all queries is up to constant factors bounded by

We split the sum at a point where and use the first term for smaller  and the second for larger. Using and , we obtain the upper bound

since by choice of . Together with the preprocessing time of Lemma 2, we obtain the desired running time . ∎

It remains to design an algorithm for jump-queries. We present two methods, one with query time and one with query time . The combined algorithm, where we first run the preprocessing of both methods, and then for each query run the method with the better guarantee on the query time, proves Lemma 2.

4.1 Jump-Queries in Time

The dictionary matching algorithm by Aho and Corasick [2] yields the following statement.

Lemma 3.

Given a set of strings , in time one can build a data structure allowing the following queries. Given a string of length , we compute the set of all substrings of that are contained in , in time .

With this lemma, we design an algorithm for jump-queries as follows. In the preprocessing, we simply build the data structure of the above lemma for each , in total time .

For a jump-query , we run the query of the above lemma on the substring of . This yields all pairs , , such that we can -jump from to . Iterating over these pairs and checking whether gives a simple algorithm for solving the jump-query. The running time is , since the query of Lemma 3 runs in time quadratic in the length of the substring .

4.2 Jump-Queries in Time

The second algorithm for jump-queries is more involved. Note that if then and the jump-query is trivial. Hence, we may assume , in addition to .

Preprocessing.

We denote the reverse of a string by , and let . We build a trie for each . Recall that a trie on a set of strings is a rooted tree with each edge labeled by an alphabet symbol, such that if we orient edges away from the root then no node has two outgoing edges with the same labels. We say that a node in the trie spells the word that is formed by concatenating all symbols on the path from the root to . The set of strings spelled by the nodes in is exactly the set of all prefixes of strings in . Finally, we say that the nodes spelling strings in are marked. We further annotate the trie by storing for each node the lowest marked ancestor .

In the preprocessing we also run the algorithm of the following lemma.

Lemma 4.

The following problem can be solved in total time . For each power of two and each index in string , compute the minimal such that is a suffix of a string in . Furthermore, compute the node in spelling the string .

Note that the second part of the problem is well-defined: stores the reversed strings , so for each suffix of a string in there is a node in spelling .

Proof.

First note that the problem decomposes over . Indeed, if we solve the problem for each  in time , then over all the total time is , as the partition and there are powers of two .

Thus, fix a power of two . It is natural to reverse all involved strings, i.e., we instead want to compute for each the maximal such that is a prefix of a string in .

Recall that a suffix tree is a compressed trie containing all suffixes of a given string . In particular, “compressed” means that if the trie would contain a path of degree 1 nodes, labeled by the symbols of a substring , then this path is replaced by an edge, which is succinctly labeled by the pair . We call each node of the uncompressed trie a position in the compressed trie, in other words, a position in a compressed trie is either one of its nodes or a pair , where is one of the edges, labeled by , and . A position is an ancestor of a position if the corresponding nodes in the uncompressed tries have this relation, i.e., if we can reach from by going up the compressed trie. It is well-known that suffix trees have linear size and can be computed in linear time [19]. In particular, iterating over all nodes of a suffix tree takes linear time, while iterating over all positions can take up to quadratic time (as each of the suffixes may give rise to positions on average).

We compute a suffix tree of . Now we determine for each node in the position in spelling the same string as , if it exists. This task is easily solved by simultaneously traversing and , for each edge in making a corresponding move in , if possible. During this procedure, we store for each node in the corresponding node in , if it exists. Moreover, for each edge in we store (if it exists) the pair , where is the lowest position corresponding to some node in , and is the corresponding node in . Note that this procedure runs in time , as we can charge all operations to nodes in .

Since is a suffix tree of , each leaf of corresponds to some suffix of . With the above annotations of , iterating over all nodes in we can determine for each leaf the lowest ancestor position of corresponding to some node in . It is easy to see that the string spelled by is the longest prefix shared by and any string in . In other words, denoting by the length of the string spelled by (which is the depth of in ), the index is maximal such that is a prefix of a string in . Undoing the reversing, is minimal such that is a suffix of a string in . Hence, setting solves the problem.

This second part of this algorithm performs one iteration over all nodes in , taking time , while we charged the first part to the nodes in , taking time linear in the size of . In total over all , we thus obtain the desired running time . ∎

For each , we also compute a maximal packing of paths with many marked nodes, as is made precise in the following lemma. Recall that in the trie for dictionary the marked nodes are the ones spelling the strings in .

Lemma 5.

Given any trie and a parameter , a -packing is a family of pairwise disjoint subsets of such that (1) each is a directed path in , i.e., it is a path from some node to some descendant of , (2) and are marked for any , and (3) each contains exactly marked nodes.

In time we can compute a maximal (i.e., non-extendable) -packing.

Proof.

We initialize . We perform a depth first search on , remembering the number of marked nodes on the path from the root to the current node . When is a leaf and , then is not contained in any directed path containing marked nodes, so we can backtrack. When we reach a node with , then from the path from the root to we delete the (possibly empty) prefix of unmarked nodes to obtain a new set that we add to . Then we restart the algorithm on all unvisited subtrees of the path from the root to . Correctness is immediate. ∎

For any power of two , we set and compute a -packing of , in total time . In , we annotate the highest node of each path as being the root of . This concludes the preprocessing.

Query Algorithm.

Consider a jump-query as in Lemma 2. For any let be the string spelled by the root of in , and let be the path from the root of to the root of (note that the labels of form ). We set , which is the set containing the length of any prefix of that is contained in , as the marked nodes in correspond to the strings in .

As the first part of the query algorithm, we compute the sumsets for all .

Now consider any . By the preprocessing (Lemma 4), we know the minimal  such that is a suffix of some , and we know the node in spelling . Observe that the path from the root to in spells the reverse of . It follows that the strings such that correspond to the marked nodes on . To solve the jump-query (for ) it would thus be sufficient to check for each marked node on whether for the depth of we have , as then we can -jump from to and have . Note that we can efficiently enumerate the marked nodes on , since each node in is annotated with its lowest marked ancestor. However, there may be up to marked nodes on , so this method would again result in running time for each , or in total.

Hence, we change this procedure as follows. Starting in , we repeatedly go the lowest marked ancestor and check whether it gives rise to a partitioning of , until we reach the root of some . Note that by maximality of we can visit less than marked ancestors before we meet any node of some , and it takes less than more steps to lowest marked ancestors to reach the root . Thus, this part of the query algorithm takes time . Observe that the remainder of the path equals . We thus can make use of the sumset as follows. The sumset contains if and only if for some we have and we can -jump from to . Hence, we simply need to check whether to finish the jump-query for .

Running Time.

As argued above, the second part of the query algorithm takes time for each , which yields in total.

For the first part of computing the sumsets, first note that contains at most strings, since its total size is at most and each string has length at least . Thus, the total number of marked nodes in is at most . As each contains exactly marked nodes, we have

(1)

For each we compute a sumset . Note that and both live in universes of size , since by definition of jump-queries, and all strings in have length less than and thus . After translation, we can even assume that . It is well-known that computing the sumset of is equivalent to computing the Boolean convolution of their indicator vectors of length . The latter in turn can be reduced to multiplication of -bit numbers, by padding every bit of an indicator vector with zero bits and concatenating all padded bits. Since multiplication is in linear time on the Word RAM, this yields an algorithm for sumset computation. Hence, performing a sumset computation can be performed in time . Over all , we obtain a running time of , by the bound (1).

Summing up both parts of the query algorithm yields running time . Note that our choice of minimizes this time bound and yields the desired query time . This finishes the proof of Lemma 2.

5 Almost-linear Time Algorithms

In this section we prove Theorem 4, i.e. we present an time algorithm for -membership and an time algorithm for