Sparse Suffix Tree Construction in Optimal Time and Space

Sparse Suffix Tree Construction in Optimal Time and Space

Paweł Gawrychowski Work done while the author held a post-doctoral position at Warsaw Center of Mathematics and Computer Science. University of Haifa, Israel Tomasz Kociumaka Supported by Polish budget funds for science in 2013-2017 as a research project under the ‘Diamond Grant’ program. Institute of Informatics, University of Warsaw, Poland
Abstract

Suffix tree (and the closely related suffix array) are fundamental structures capturing all substrings of a given text essentially by storing all its suffixes in the lexicographical order. In some applications, such as sparse text indexing, we work with a subset of interesting suffixes, which are stored in the so-called sparse suffix tree. Because the size of this structure is , it is natural to seek a construction algorithm using only words of space assuming read-only random access to the text. We design a linear-time Monte Carlo algorithm for this problem, hence resolving an open question explicitly stated by Bille et al. [TALG 2016]. The best previously known algorithm by I et al. [STACS 2014] works in time. As opposed to previous solutions, which were based on the divide-and-conquer paradigm, our solution proceeds in rounds. In the -th round, we consider all suffixes starting at positions congruent to modulo . By maintaining rolling hashes, we can lexicographically sort all interesting suffixes starting at such positions, and then we can merge them with the already considered suffixes. For efficient merging, we also need to answer LCE queries efficiently (and in small space). By plugging in the structure of Bille et al. [CPM 2015] we obtain time complexity. We improve this structure by a recursive application of the so-called difference covers, which then implies a linear-time sparse suffix tree construction algorithm.

We complement our Monte Carlo algorithm with a deterministic verification procedure. The verification takes time, which improves upon the bound of obtained by I et al. [STACS 2014]. This is obtained by first observing that the pruning done inside the previous solution has a rather clean description using the notion of graph spanners with small multiplicative stretch. Then, we are able to decrease the verification time by applying difference covers twice. Combined with the Monte Carlo algorithm, this gives us an -time and -space Las Vegas algorithm.

1 Introduction

In many if not all algorithms operating on texts one needs a compact representation of all substrings. A well-known data structure capturing all substrings of a given text is the suffix tree, which is a compacted trie storing all suffixes. The size of the suffix tree is linear in the length of the text and it provides efficient indexing, that is, locating all occurrences of a given pattern. The first linear-time suffix tree construction algorithm was given by Wiener [26]. Later, McCreight [23] provided a simpler procedure, and Ukkonen showed a different approach that allows the text to be maintained under appending characters [25]. All these algorithms work in linear time assuming constant-size alphabet. However, such assumption is not always justified. Farach developed a different construction method based on the divide-and-conquer paradigm that takes only linear time as long as the alphabet is linear-time sortable [9]. In particular, his algorithm works in linear time for polynomially-bounded integer alphabets.

While the suffix tree provides a lot of information about the structure of the text and hence is a very convenient building block in more complicated algorithms, its large memory footprint is often prohibitive in practice. Hence there has been a lot of interest in suffix arrays [22]. A suffix array is just a lexicographically sorted list of all suffixes of the text. The list is usually augmented with the so-called LCP table, which stores the length of the longest common prefix of every two adjacent suffixes. This leads to a very memory efficient representation that is still capable of providing enough information about the text to replace suffix trees in all applications with no or very small penalty in the time complexity [1]. Furthermore, a suffix array can be constructed in linear time for any linear-time sortable alphabet with a simple and practical algorithm [18].

Even though the suffix array occupies linear space when measured in words, this might be larger than the encoding of the text. This started a long line of work on compressed suffix arrays [10, 14, 15], which take space proportional to the entropy of the text. However, in some applications even smaller space usage is desired. In particular, in the last few years there has been a lot of interest among the string algorithms community in designing sublinear space solutions, where one assumes a read-only random access to the input text and measures the working space. Of course, the running time should still be linear or close to linear.

A natural idea for text indexing in sublinear space is to use some additional knowledge about the structure of the queries to consider only a (small) subset of all suffixes of the text, and provide indexing only for occurrences starting at the corresponding positions. This was first explored by Kärkkäinen and Ukkonen [19], who introduced the sparse suffix tree. They showed that the evenly space sparse suffix tree, which is the compacted trie storing every -th suffix of the text, can be constructed in linear time and working space proportional to the number of suffixes. However, the question of construct a general sparse suffix tree (or sparse suffix array, which is a lexicographically sorted list of the chosen subset of suffixes together with their LCP information) for an arbitrary subset of suffixes of a text using working space, remained open. Recently, Bille et al. [4] were able to make a significant progress towards resolving this question by developing an -time Monte Carlo algorithm. They also provided a verification procedure implying an Las Vegas algorithm. Then, I et al. [17] improved the complexity to (both for Monte Carlo and Las Vegas randomization). They also gave an -time Monte Carlo solution using space. Very recently, Fischer et al. [12] gave an -time and -space deterministic algorithm in a stronger model of rewritable text which needs to be restored before termination.

Another natural problem in the model of read-only random access to the text is LCE queries, where we are to preprocess a text subject to queries returning the longest common prefix of two suffixes and . Several trade-off between query time, data structure size, construction time and space usage have been obtained [6, 5, 24]. The queries are typically deterministic, but construction algorithms range from Monte Carlo randomization via Las Vegas randomization to deterministic solutions. State-of-the-art Monte Carlo data structures of size have -time and -space construction with -time queries, or -time and -space construction with -time queries [5].

Our contribution.

We design an -time Monte Carlo algorithm for sorting an arbitrary subset of suffixes of a text using working space. We also show how to verify the answer in time and working space, which implies a Las Vegas algorithm with such complexity. Hence, for Monte Carlo algorithms we close the problem, while for Las Vegas algorithm we are able to make a substantial progress towards the desired linear time complexity.

As an auxiliary result, we also develop an -space LCE data structure with -time queries and -time and -space Monte Carlo construction.

Model.

We are given read-only random access to a text consisting of characters from . We assume the standard word RAM model with word size , where basic arithmetic and bit-wise operations on -bit integers take constant time. Our randomized algorithms succeed with high probability, i.e., for any user-specified constant .

Previous and our techniques.

The Monte Carlo algorithm of I et al. [17] is based on the notion of -strict sparse suffix trees. Intuitively, they are approximate variants of the sparse suffix tree operating on blocks of length rather with single-character precision. The algorithm starts with a trivial -strict sparse suffix tree and performs steps, each of which halves the block length.

Our algorithm, described in Section 3, performs just two steps. Its intermediate result, the coarse compacted trie, is basically the same as the -strict sparse suffix tree. The second phase, building the sparse suffix tree from the coarse compacted trie, is relatively easy. For the more challenging first step, we employ Karp–Rabin fingerprints stored in a rolling fashion. More precisely, we proceed in rounds; in the -th round, we store fingerprints of all suffixes starting at positions congruent to modulo . This way, while inserting such a suffix to the coarse compacted trie, we can guarantee that almost all fingerprints needed can be accessed in constant time. The last step of every insertion reduces to a longest common extension (LCE) query. If we apply a recent data structure by Bille et al. [5], the total running time becomes .

To obtain time, we improve the LCE query time to . Bille et al. [5] already applied difference covers for that purpose, but this was at the expense of superlinear preprocessing time since a certain sparse suffix tree had to be constructed. Our main insight is that one can build a sequence of larger and larger sparse suffix trees, each of them providing faster LCE queries used to construct the next sparse suffix tree.

Our complementary result, an -time Las Vegas algorithm, is provided in Section 4. In short, our solution is based on an -time algorithm by I et al. [17] for verifying substring equations. We provide a few improvements by exploiting a slightly cleaner (though more complex) formalization of the underlying ideas. First, we avoid eagerly checking some constraints, which lets us reduce the running time to . Next, we use difference covers to restrict the set of starting positions of fragments involved in equations; this technique independently speeds up two steps of the algorithm.

2 Preliminaries

We consider finite strings over an integer alphabet . For a string , its length is . For , a string is called a substring of . By we denote its occurrence at position , called a fragment of . A fragment with is called a prefix (also denoted ) and a fragment with is called a suffix (denoted ).

Tries and compacted tries.

Recall that a trie is a rooted tree whose nodes correspond to prefixes of strings in a given set of strings . The prefix corresponding to a node is denoted , and the node is called the locus of . We extend this notion as follows: the locus of an arbitrary string is the node such that is prefix of and and is maximized.

The parent-child relation in the trie is defined so that the root is the locus of , while the parent  of a node is the locus of without the last character. This character is the label of the edge from to . The order on the alphabet naturally yields an order on the edges outgoing from any node of the trie, so tries are often assumed to be ordered rooted trees.

A node is branching if it has at least two children and terminal if . A compacted trie is obtained from the underlying trie by dissolving all nodes except the root, branching nodes, and terminal nodes. The dissolved nodes are called implicit while the preserved nodes are called explicit. The compacted trie takes space provided that edge labels are stored as pointers to fragments of strings in . In some applications, the first character is kept explicitly, however.

Note that the suffix tree of a string is precisely the compacted trie of the set of all suffixes of . Similarly, the sparse suffix tree of an arbitrary set of suffixes of is the compacted trie of . Given the lexicographic order on along with the lengths of the longest common prefixes between any two consecutive (in this order) elements of , one can easily compute the compacted trie in time; see e.g. [8]. Thus, the problem of constructing the sparse suffix tree is equivalent to that of building the sparse suffix array along with the LCP values.

LCA queries.

For two nodes of a trie, we denote their lowest common ancestor by . Since is the longest common prefix of and , the data structures for LCA queries are often applied to efficiently determine longest common prefixes.

Lemma 2.1 ([16, 3]).

The compacted trie of a set of strings can be preprocessed in time to compute the length of the longest common prefix of any two strings in in constant time.

Karp–Rabin fingerprints.

For a prime number , an integer , the Karp–Rabin fingerprint [20] of a string is . For efficiency, we augment it as follows:

Observation 2.2.

Let be strings such that . Given two out of three fingerprints , the third one can be computed in constant time.

For a text , we say that the fingerprints are collision-free if implies . Randomization lets us construct such fingerprints with high probability:

Fact 2.3.

Let be a text of length over alphabet , and let be a prime number such that . If is uniformly random, is collision-free with probability at least .

This is the only source of randomization in this paper. The original argument by Karp and Rabin [20] used random and fixed , but we use the more modern approach; see e.g. [4, 5, 17].

LCE queries.

For a text of length and two positions (), we define as the length of the longest common prefix of and . We use the following recent result as a building block in our algorithms:

Lemma 2.4 (Bille et al. [5]).

Given read-only random access to a text of length and a parameter , , it is possible to construct in time and space a structure of size , which answers LCE queries in time, where is the result of the query. The data structure assumes collision-free Karp–Rabin fingerprints.

Difference covers.

We say that a set is a -difference-cover (of ) if for each there is a value , , such that and . We say that can be indexed efficiently, if there is a bijection such that and are computable in constant time.

Lemma 2.5 (Maekawa [21], Burkhard and Kärkkäinen [7]).

For every positive integers and , , there exists a -difference-cover of size , such that can be evaluated in constant time and can be efficiently indexed.

3 Monte Carlo Algorithm

In this section, we provide an -time and -space Monte Carlo algorithm for computing the sparse suffix tree. In Section 3.1, we introduce the coarse compacted trie, which is an intermediate byproduct of our algorithm, and we show how to use it to build the sought sparse suffix tree. Then we concentrate on computing the coarse compacted trie. Section 3.2 provides an -time algorithm whose bottleneck are LCE queries. We overcome this in Section 3.3 by providing an improved data structure for LCE queries. We conclude the exposition in Section 3.4.

3.1 Coarse Compacted Tries

The coarse compacted trie of a collection of strings is defined as follows. We conceptually partition each string into blocks consisting of characters (the last block might be shorter). Then we consider each block to be a single supercharacter, and we form the compacted trie of the resulting set of strings. We use fingerprints to represent the supercharacters; hence, the order on the edges outgoing from the same node in a coarse compacted trie corresponds to the order of the fingerprints of the first block on every edge, not to the lexicographical order of the blocks.

Below, we show that the sought sparse suffix tree can be quite easily derived from the corresponding coarse compacted trie. The key building block is sorting strings of length . For , a simple comparison-based algorithm achieves time and space complexity.

Fact 3.1.

Given random access to strings of length , the strings can be sorted in time and space.

Proof.

We shall prove that a single string can be inserted into a sorted array of strings in time. We scan the consecutive letters of . Having read , we maintain a partition of the strings in the array into three classes depending on whether is smaller, equal, or greater than . Note that these classes form consecutive ranges. Moreover, in order to update the partition, it suffices to scan the ‘equal’ class from both ends and remove all leading strings satisfying , and all trailing strings satisfying . The running time is proportional to the number of strings removed plus , which gives in total over all steps. After we scan the whole , the partition determines the position where it should be inserted. ∎

For , we have enough space to use an algorithm based on counting sort.

Fact 3.2.

Assume we are given random access to strings of length over alphabet . For any positive integer , the strings can be sorted in time and space.

Proof.

We treat every character as a -bit integer and partition it into chunks of bits. We radix sort the resulting collection of strings of length over a smaller alphabet . With iterations of counting sort, we obtain the claimed bounds. ∎

Now, we can give the procedure building the compacted trie from of the coarse compacted trie.

Lemma 3.3.

Given the coarse compacted trie of a set of suffixes, we can construct their compacted trie in time and space.

Proof.

Consider a branching node of the coarse compacted trie and let be its implicit children, that is, is an edge labelled with a single supercharacter representing a string of length . To obtain the compacted trie, we remove the edges and paste the compacted trie of the strings to connect with . Such a trie is easy to construct in time by inserting strings in the lexicographic order: we follow the rightmost path while its label matches the inserted string, and we create a new branch as soon as it does not match. Hence, we only need to show how sort the strings .

We gather such collections of strings from all branching nodes, sort the disjoint union of all these collections, and then recover the sorted collections with a single scan. Note that the total number of strings to be sorted is bounded by the number of edges in the coarse compacted trie, which is . If , we sort using creftype 3.1, while for we apply creftype 3.2 with and . In both cases, sorting is performed is time and space. This is also the overall time and space necessary to construct compacted tries based on the sorted collections. ∎

3.2 Construction of Coarse Compacted Tries

Again, we provide slightly different construction algorithms for and . Both algorithms use Lemma 2.4 and rolling Karp–Rabin fingerprints, which we describe below. Moreover, while constructing the coarse compacted trie, at each explicit node we always store the fingerprint of the corresponding string . Similarly, each edge stores the fingerprint of the first block of its label, and a reference to a fragment of representing the whole label. For each node, the outgoing edges are kept in a doubly-linked list (ordered by the fingerprints of their first blocks).

For an integer , we say a position is -aligned if . A fragment is called an -aligned fragment of if is an -aligned position and or is also -aligned. The key idea of our algorithm is to proceed in rounds so that in the -th round, while inserting -aligned suffixes, we can compute the fingerprint of any -aligned fragment in constant time. More formally, we denote the set of all -aligned suffixes by , and we define a component consisting of the fingerprint of every suffix in , as well as the fingerprint of the whole text . The following fact, stating all the necessary properties of , easily follows from creftype 2.2.

Fact 3.4.

The component can be constructed in time and space. Moreover, given , the fingerprint of any -aligned fragment can be computed in constant time, and can be constructed in time and space.

3.2.1 Small

First, we show how to construct coarse compacted trie for a set of suffixes. Our procedure actually works for an arbitrary number of suffixes, but it is too slow for suffixes.

Theorem 3.5.

The coarse compacted trie of set of suffixes of a text of length can be computed in time.

Proof.

We proceed in rounds corresponding to . In the -th round, we insert -aligned suffixes one by one. For this, we use and the LCE-structure from Lemma 2.4.

We locate the locus of in the current trie by a traversal starting at the root. If we are currently at an explicit node , we use to obtain the fingerprint of the next supercharacter to be followed. Next, we scan the edges going out of comparing that fingerprint with the ones store with the edges. If none of them matches, then is the locus and we insert a new leaf with as its parent. Otherwise, we have selected an edge leading to a child of . We check if the locus is in the subtree of by comparing with the fingerprint of the corresponding (-aligned) prefix of . If so, we continue the traversal at . Otherwise, we know that the locus is an implicit node on the edge . In this case, we use one LCE query (rounded down to a multiple of ) to calculate the exact position of the locus on the edge, and we attach a new leaf there. Then, we also need to spend time to compute the fingerprint of the edge created by subdividing at the locus. All other fingerprints stored in the new nodes and edges are computed in constant time since the corresponding fragments are -aligned.

The trie is of size , so in a single traversal we visit at most explicit nodes and scan edges in total, spending time at each of them. The last step requires time for an LCE query and fingerprint computation, so the overall insertion time is . Summing over all the rounds, this is (including the time to maintain and build the LCE-structure). Space consumption remains throughout the algorithm. ∎

3.2.2 Large

For larger , instead of processing suffixes one by one, we insert all suffixes from in bulk. Our insertion algorithm requires the coarse compacted trie of , which is obtained from the coarse compacted trie of .

Lemma 3.6.

For , given the coarse compacted trie of can be computed in time and space.

Proof.

For each -aligned position , we define its block as . Then, we define a string where is the fingerprint of the block corresponding to the -th leftmost -aligned position in . In other words, is obtained from by partitioning into blocks of size and replacing each block by its fingerprint. Observe that this partition coincides with the partitions of -aligned suffixes in the definition of the coarse compacted trie of . Thus, the coarse compacted trie of is precisely the suffix tree of .

The first step of our construction algorithm is to compute . Using , this takes time since blocks are -aligned fragments. Then, we sort the letters of using creftype 3.2 with , , and . This way, using time and space, each character of can be replaced by its rank, which is at most . After such normalization, we construct the suffix tree of in time and space. Finally, we replace the normalized characters on the edges by the original -bit fingerprints. ∎

Next, we implement the bulk insertion procedure. Note that coarse compacted trie of can be extracted in time from the coarse compacted trie of . We define .

Lemma 3.7.

Assume that we have access to and an LCE-structure with query time. Given the coarse compacted trie for and the coarse compacted trie for , we can construct the coarse compacted trie for in time and space.

Proof.

Intuitively, our aim is to traverse the Euler tour of the resulting coarse compacted trie of . More precisely, we process consecutive strings and for each we find the locus of in the trie of and then add to the trie. Strings are processed according to the coarse lexicographic order, so we only need to move forward on the Euler tour to reach the next locus; actually, we can start from the new terminal node (representing ), possibly skipping some part of the Euler tour.

While visiting an edge whose label starts with a supercharacter , we shall find out if the locus of is an implicit node on that edge. An equivalent condition is that is a prefix of followed by a block represented by , but is not a prefix of . For this, we compute the fingerprints of the appropriate fragments of . Since , these fragments are -aligned in , so the fingerprints are determined in time. If the locus turns out to be on the current edge, we make an LCE query to determine its exact depth; the result needs to be rounded down to full blocks. Next, we introduce a new explicit node on the edge and a terminal node representing . Fingerprints stored at the new nodes and edges can be computed in time, except for the one at , which takes time as the corresponding block may not be -aligned.

Similarly, while at an explicit node , we first check if is a prefix of ; otherwise, the subtree of can be ignored. Next, we compute the supercharacter representing the block of following and compare it to the labels of edges from to its children. Note that a vertex with children appears on the Euler tour times, so during any visit to we need to consider at most two edges whose labels are the lower bound and the upper bound on . If turns out to be within the range, we simply introduce the leaf representing as a child of . The new edge (labeled with ) is inserted between the two considered.

Hence, in either case it takes time to make progress traversing the Euler tour, and at most time to extend the trie. This sums up to in total. ∎

Theorem 3.8.

Assume that we have access to LCE-structure with query time. The coarse compacted trie of any set of suffixes can be computed using time and space.

Proof.

We iterate over while maintaining . By creftype 3.4, this takes time and space in total. In the -th round, we apply Lemma 3.6 to compute the coarse compacted trie of , and then we use Lemma 3.7 to merge it with the coarse compacted trie of to obtain the coarse compacted trie of . The total cost of all applications of Lemma 3.7 is

Corollary 3.9.

For any set of suffixes of , the sparse suffix tree can be computed using time and space. The resulting tree might be incorrect with probability for a user-defined constant .

Proof.

The algorithm builds the coarse compacted trie of the suffixes. If is small, we use Theorem 3.5. Otherwise, we apply Theorem 3.8 with Lemma 2.4 to answer LCE queries. This results in , so the total running time is . Finally, we construct the sparse suffix tree using Lemma 3.3. ∎

3.3 More efficient LCE queries

In this section, we provide a faster data structure for LCE queries. Our solutions takes space, has construction time and query time. When used in Theorem 3.8, the running time immediately improves to .

The main idea is similar to that by Bille et al. [5]: we use a sparse suffix tree for a -difference cover to process LCE queries with answer at least in constant time. Bille et al. used the algorithm by I et al. [17] to build the tree, which resulted in construction time. We devise a recursive approach: we construct sparse suffix arrays for larger and larger difference covers, using the previous one to speed up the construction of the next. The largest of these difference covers consists of suffixes but we can still use space. Thus, having relatively more space, we can apply a much simpler variant of the algorithm of Theorem 3.8.

Formally, at the -th level of recursion, we have a set of suffixes, which is a -difference cover for obtained using Lemma 2.5. The recursion terminates at such that . Hence, for .

Lemma 3.10 (see Section 3.6 in [5]).

After -time and -space preprocessing, for every , , the sparse suffix tree of can be processed in time so that LCE queries can be answered in time.

Proof.

We store the structure of Lemma 2.4 and the component (to test equality of fragments in time). Given the sparse suffix tree of , we build the data structure of Lemma 2.1 for longest common prefix queries on . Next, we exploit the fact that forms a difference cover.

To find , we first test whether the result is at least . This involves a single substring equality check. If the answer is positive, we compute the result in constant time as with the second summand determined using the component of Lemma 2.1. Otherwise, we use the data structure of Lemma 2.4. The running time is . ∎

Next, we observe that the approach of Lemma 3.3 can also be used to process a coarse compacted trie of suffixes in the trie.

Lemma 3.11.

Let be positive integers such that and . Given the coarse compacted trie of an arbitrary set of suffixes, we can construct their compacted trie in time and space.

Recall that for , we defined as the set of -aligned suffixes. We denote .

Lemma 3.12.

The sparse suffix arrays of (for and ) can be computed in time and space in total if .

Proof.

First, we use Lemma 2.5 to generate for each . This takes time and space. Next, we lexicographically sort triples for in time and space using creftype 3.2 for , , and . Finally, we iterate over and construct the sparse suffix trees of based on the coarse compacted tries of . Building the latter takes time and space across all iterations by creftypeplural 3.4 and 3.6. We also extend the tree with the data structure of Lemma 2.1 for LCP queries. We use the previously constructed list of triples to build an array indexed by elements storing in each entry a list of integers such that . Next, we extract coarse compacted tries of from the coarse compacted trie of . Finally, we use Lemma 3.11 to obtain the sparse suffix arrays of . ∎

Theorem 3.13.

There is a data structure of size which can be constructed in time and answers LCE queries in time. The data structure might be corrupted with probability for a user-defined constant .

Proof.

If , we use the standard -space data structure. Similarly, if , then the query time of the data structure of Lemma 2.4 is always , so there is nothing to do as well. Thus, we assume .

First, we use Lemma 3.12 to compute the sparse suffix arrays of for and . Then, we iterate over and, for each such , merge the suffix arrays of all to obtain the sparse suffix array of . This is easily achieved using LCE queries. We answer these queries using the component of Lemma 2.4 for and Lemma 3.10 (plugging in the sparse suffix tree constructed for ) for . In either case, the cost of the LCE query is .

The space consumption is clearly throughout the algorithm. The overall construction time is dominated by on LCE queries, leading to the total running time of

3.4 Summary

Theorem 3.14.

For any set of suffixes of , the sparse suffix tree can be computed using time and space. The resulting array might be incorrect with probability for a user-defined constant .

Proof.

First, we apply Theorem 3.13 to construct an efficient data structure for LCE queries. Next, the algorithm builds the coarse compacted trie of the suffixes. We use Theorem 3.5 if and Theorem 3.8 otherwise. Finally, we construct the sparse suffix tree using Lemma 3.3. ∎

4 Las Vegas Algorithm

A simple way to obtain a Las Vegas algorithm is to run a Monte Carlo procedure and then verify if the result is correct. Thus, for each two adjacent suffixes and in the claimed lexicographic order, we need to check if their longest common prefix has been computed correctly and if is indeed lexicographically smaller than . Equivalently, these conditions can be stated as and . The latter constraint is trivial to verify, so the problem boils down to checking whether satisfies a system of substring equations.

We provide a deterministic -time -space algorithm for that problem, improving upon an -time solution by I et al. [17]. Let us start by recalling a simpler -time version of their algorithm. A straightforward reduction allows to restrict the lengths of all equations to for integer values . Now, the main idea is to relax the problem: a YES answer is required if all equations are satisfied, but a NO—only if there is a mismatch within the first positions of an equation. The exact problem easily reduces to two instances of the relaxed version: one on the original text and one on its reverse. Then, the algorithm of I et al. works in phases. Each phase is given equations of length and it is responsible for making sure that there are no mismatches within the middle positions. If so, the equations can be shortened to . Once the maximum length is sufficiently small, all equations are verified naively.

In each phase, a graph is constructed, with nodes corresponding to blocks of the text. Each edge represents an equation and connects nodes corresponding to blocks containing the starting positions of the involved fragments. Then, constraints are verified naively. These are original equations forming an -spanner of the graph, as well as constraints stating that certain fragments of the text have a given period (such constraints can also be expressed as substring equations). This turns out to be sufficient to implement the phase.

We introduce a few important changes to the above algorithm. First, we avoid the eager verification of the equations. Instead, we split them into several equations of length and process them in further phases as if they were given in the input. Secondly, we observe that we can create these equations so that they have slack on both ends: just the middle positions need to be checked. Thus, we change the original relaxation to work with such equations only. Some technical details (e.g., the reduction from the exact problem) become more complex, but now each phase simply converts a system of (relaxed) equations of length to a system of (relaxed) equations of length . This way, we achieve running time.

The notion of difference covers lets us further exploit the slack at both sides of the equations and consequently reduce the running time to . We perturb each of them so that the lengths are still uniform, but the starting positions of both fragments involved belong to a certain set. First, this approach lets us restrict the set of starting positions of long equations, and consequently remove most of them. (Detecting irrelevant equations also involves computing the maximum-weight spanning forest in a certain graph.) Secondly, we reduce the number of blocks where starting positions appear, i.e., the number of vertices of the graph for which the spanner is constructed. This results in fewer equations being created in each phase.

In the following sections, we formalize the above discussion. In Section 4.1 we introduce the notion of substring equation and we prove basic facts. In Sections 4.3 and 4.2, we formalize further tools already used by I et al., originating in combinatorics of strings and graph spanners, respectively. Next, in Section 4.4 we provide our first solution, whose running time is comparable to the previous state of the art, and in Section 4.5 we improve the time complexity to . Both algorithms require space.

4.1 Systems of Substring Equations

Consider texts of length . For any integers such that , , and , we say that is a substring equation. The quantity is called the length of the equation and is called the shift. Note that shift is oriented: its sign is reversed when we write as . A particular text satisfies the equation whenever and are occurrences of the same string. Equations of length 0 or less are assumed to be trivially satisfied. For an integer we say that satisfies with shortage whenever satisfies .

Intuitively, our algorithm internally works with relaxed equations: we can verify if they are satisfied (with no shortage), but it suffices to check if they are satisfied with some shortage , not necessarily uniform across all equations. This justifies the structure of most auxiliary results.

The following fact describes how a single relaxed equation can be replaced by a system of shorter relaxed equations. If and , we can think of this procedure as splitting into .

Fact 4.1.

Let () be substring equations with the same shift such that and for . Moreover, let () be non-negative integers such that for .

  1. If is satisfied, then are satisfied.

  2. If each is satisfied with shortage , then is satisfied with shortage .

Proof.

(a) For , is clearly a subfragment of . Since has the same shift as , is the corresponding subfragment of . Thus, must be satisfied if is.

(b) We inductively prove that cover . This claim is trivial for , while the inductive step easily follows from the fact that . Consequently, we obtain that cover . By definition of , the later covers . Since all equations have the same shift, this yields (b). ∎

A set of substring equations on the same text is called a system. A particular text satisfies the system (with shortage ) if it satisfies all its member equations (with shortage resp.). A system is called uniform if all equations in the system have the same length. Uniform systems can be obtained as follows by splitting longer equations using creftype 4.1:

Corollary 4.2.

Let be an equation of length . For every positive integer , there exists a uniform system of equations of length such that if is satisfied, then is satisfied, and if is satisfied with shortage (), then is also satisfied with shortage . Moreover, can be computed in time.

Proof.

Let and let . We include in equations of length and start positions for , as well as . It is easy to see that these equations satisfy creftype 4.1 with provided that (e.g., if ). ∎

The following result, in a simpler version introduced by I et al. [17], relates a cycle of substring equations with periods of certain fragments.

Lemma 4.3.

Let be a uniform system of substring equations with . Assuming , let us define and .

  1. If is satisfied, then is a period of .

  2. For a positive integer , if is a period of and is satisfied with shortage , then is satisfied with shortage .

Proof.

For let us define ; we also have , since the system is uniform. Note that and that . Before we proceed, let us show the following auxiliary result:

Claim.

If is satisfied with shortage , then for each .

Proof of the claim.

We provide an inductive proof. For , equalities and trivially yield the inductive base. Now, suppose that the claim is satisfied for , i.e., . Since , the fact that is satisfied with shortage implies . Equalities and further yield , which concludes the proof of the claim. ∎

Let us proceed to the proofs of (a) and (b).

(a) The claim implies , while further yields . Consequently, must be a period of .

(b) The claim implies . The fact that has period , yields . Combining these two equalities, we conclude that , i.e., that is satisfied with shortage . ∎

4.2 Period constraints

The fact that a fragment has period can be expressed as a substring equation . This condition is particularly important if , i.e., if does not exceed the length of the underlying equation. Thus, if for an equation the absolute value of the shift does not exceed the length, we say that the equation is a (substring) period constraint which concerns and enforces period . Periodicity Lemma, a classic tool of combinatorics of words, lets us essentially reduce several period constraints enforcing different periods with a single period constraint enforcing the greatest common divisor of these periods.

Lemma 4.4 (Periodicity Lemma, Fine and Wilf [11]).

If a string has length and periods such that , then it also has as a period.

Lemma 4.5.

Let be a system of period constraints concerning a fragment and enforcing periods , and let be a constraint enforcing as a period of .

  1. is satisfied if and only if is satisfied.

  2. For every positive integer , if is satisfied with shortage , then is satisfied with shortage .

Proof.

Note that is satisfied with shortage whenever has period (not necessarily proper). Thus, the fact that is satisfied with shortage means that has period and consequently all periods (as ). Consequently, is satisfied with storage , as claimed in (b). The ‘if’ part of (a) is a special case for , so let us proceed to the proof of the ‘only if’ part. Suppose that is satisfied. Iteratively applying Periodicity Lemma (Lemma 4.4) to with periods and , we conclude that is a period of , i.e., that is satisfied. Note that due to the fact that by definition of a period constraint. ∎

4.3 Graph Spanners

In this section, we recall a classic construction of a -multiplicative graph spanner of size at most . The underlying original idea is by Awerbuch [2]. These spanners have already been implicitly applied in [17] to verify substring equations. Our setting requires some extra information concerning the paths witnessing that each edge has low enough stretch. Thus, we formulate the algorithm in a non-standard way, formalizing the contribution of [17].

Let be a undirected multigraph. We interpret each edge of as a pair of inverse arcs. For a subset , we define to be the set of arcs corresponding to the edges in . For an arc , we define as the arc inverse to (this way ). We say that in an oriented weight function if