Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing1footnote 11footnote 1A preliminary version of this paper appeared in the Proceedings of the 28th Annual symposium on Combinatorial Pattern Matching, 2017

Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing111A preliminary version of this paper appeared in the Proceedings of the 28th Annual symposium on Combinatorial Pattern Matching, 2017

Philip Bille
phbi@dtu.dk
Supported by the Danish Research Council (DFF – 4005-00267, DFF – 1323-00178)
   Mikko Berggren Ettienne
miet@dtu.dk
Supported by the Danish Research Council (DFF – 4005-00267)
   Inge Li Gørtz
inge@dtu.dk
   Hjalte Wedel Vildhøj
hwvi@dtu.dk
Abstract

Given a string , the compressed indexing problem is to preprocess into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let , and denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string of length , we can solve the problem in

  1. time using space, or

  2. time using space, for any

In particular, (i) improves the leading term in the query time of the previous best solution from to at the cost of increasing the space by a factor . Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes . Our index also supports extraction of any substring of length in time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

1 Introduction

Given a string , the compressed indexing problem is to preprocess into a compressed representation that supports fast substring queries, that is, given a string , report all occurrences of substrings in that match . Here the compressed representation can be any compression scheme or measure (th order entropy, smallest grammar, Lempel-Ziv, etc.). The goal is to use little space relative to the compressed size of while supporting fast queries. Compressed indexing is a key computational primitive for querying massive data sets and the area has received significant attention over the last decades with numerous theoretical and practical solutions, see e.g. [26, 14, 30, 24, 15, 16, 22, 23, 17, 35, 31, 11, 28, 20, 25, 5] and the surveys [35, 33, 34, 21].

The Lempel-Ziv 1977 compression scheme (LZ77) [39] is a classic compression scheme based on replacing repetitions by references in a greedy left-to-right order. Numerous variants of LZ77 have been developed and several widely used implementations are available (such as gzip [1]). Recently, LZ77 has been shown to be particularly effective at handling highly-repetitive data sets [31, 33, 28, 10, 4] and LZ77 compression is always at least as powerful as any grammar representation [38, 9].

In this paper, we consider compressed indexing based on LZ77 compression. Relatively few results are known for this version of the problem. Let , , and denote the size of the input string, the compressed LZ77 string, and the pattern string, respectively. Kärkkäinen and Ukkonen introduced the problem in 1996 [26] and gave an initial solution that required read-only access to the uncompressed text. Interestingly, this work is among the first results in compressed indexing [35]. More recently, Gagie et al. [19, 20] revisited the problem and gave a solution using space and query time , where is the number of occurrences of in . Note that these bounds assume a constant sized alphabet.

1.1 Our Results

We show the following main result.

Theorem 1.

Given a string of length from a constant sized alphabet compressed using LZ77 into a string of length we can build a compressed-index supporting substring queries in:

  1. time using space, or

  2. time using space, for any

Compared to the previous bounds Thm. 1 obtains new interesting trade-offs. In particular, Thm. 1 (i) improves the leading term in the query time of the previous best solution from to at the cost of increasing the space by only a factor . Alternatively, Thm. 1 (ii) matches the previous best space bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes .

Gagie et al. [20] also showed how to extract an arbitrary substring of of length in time . We show how to support the same extraction operation and slightly improve the time to .

Technically, our results are obtained by new variants and extensions of existing data structures in novel combinations. In particular, we consider a batched variant of the weak prefix search problem and give the first non-trivial solution to it. We also generalize the well-known bidirectional compact trie search technique [29] to reduce the number of queries at the cost of increasing space. Finally, we show how to combine this efficiently with range reporting and fast random-access in a balanced grammar leading to the result.

As mentioned all of the above bounds hold for a constant size alphabet. However, Thm. 1 is an instance of full time-space trade-off that also supports general alphabets. We discuss the details in Sec. 8.

2 Preliminaries

We assume a standard unit-cost RAM model with word size and that the input is from an integer alphabet and measure space complexity in words unless otherwise specified.

A string of length is a sequence of characters drawn from . The string denoted is called a substring of . is the empty string and while when . The substrings and are the prefix and the suffix of respectively. The reverse of the string is denoted . We use the results from Fredman et al. [18] when referring to perfect hashing allowing us to build a dictionary on constant sized keys in expected time supporting constant time lookups.

2.1 Compact Tries

A trie for a set of strings is a rooted tree where the vertices corresponds to the prefixes of the strings in . denotes the prefix corresponding to the vertex . if is the root while is the parent of if is equal to without the last character. This character is then the label of the edge from to . The depth of vertex is the number of edges on the path from to the root.

We assume each string in is terminated by a special character such that each string in corresponds to a leaf. The children of each vertex are sorted from left to right in increasing lexicographical order, and therefore the left to right order of the leaves corresponds to the lexicographical order of the strings in . Let rank denote the rank of the string in this order.

A compact trie for denoted  is obtained from the trie by removing all vertices with exactly one child excluding the root and replacing the two edges incident to with a single edge from its parent to its child. This edge is then labeled with the concatenation of the edge labels it replaces, thus the edges of a compact trie may be labeled by strings. The skip interval of a vertex with parent is denoted and if is the root. The locus of a string in , denoted locus, is the minimum depth vertex such that is a prefix of str. If there is no such vertex, then .

In order to reduce the space used by  we only store the first character of every edge and in every vertex we store (This variation is also known as a PATRICIA tree [32]). We navigate by storing a dictionary in every internal vertex mapping the first character of the label of an edge to the respective child. The size of  is .

2.2 Karp-Rabin Fingerprints

A Karp-Rabin fingerprinting function [27] is a randomized hash function for strings. We use a variation of the original definition appearing in Porat and Porat [37]. The fingerprint for a string of length is defined as:

where is a prime and is a random integer in (the field of integers modulo ). Storing the values , and along with a fingerprint allows for efficient composition an subtraction of fingerprints:

Lemma 1 (Porat and Porat [37], Breslauer and Galil [7]).

Let be strings such that . Given two of the three fingerprints , the third can be computed in constant time.

It follows that we can compute and store the fingerprints of each of the prefixes of a string of length in time and space such that we afterwards can compute the fingerprint of any substring in constant time. We say that the fingerprints of the strings and collide when and . A fingerprinting function is collision-free for a set of strings if there are no fingerprint collisions between any of the strings.

Lemma 2 (Porat and Porat [37]).

Let and be different strings of length at most and let for some . The probability that is less than .

2.3 Range Reporting

Let be a set of points in a d-dimensional grid. The orthogonal range reporting problem in -dimensions is to compactly represent while supporting range reporting queries, that is, given a rectangle report all points in the set . We use the following results for 2-dimensional range reporting:

Lemma 3 (Chan et al. [8]).

For any set of points in and we can solve 2-d orthogonal range reporting with expected preprocessing time, space and query time where is the number of occurrences inside the rectangle.

2.4 Lz77

The Ziv-Lempel algorithm from 1977 [39] provides a simple and natural way to compress strings.

The LZ77 parse of a string of length is a sequence of subsequent substrings of called phrases such that . is constructed in a left to right pass of : Assume that we have found the sequence producing the string and let be the longest prefix of that is also a substring of . Then . The occurrence of in is called the source of the phrase . Thus a phrase is composed by the contents of its possibly empty source and a trailing character which we call the phrase border and is typically represented as a triple where start is the starting position of the source, len is the length of the source and is the border. For a phrase we denote the position of its border by and its source by . For example, the string of length has the LZ77 parse of length 4 which is represented as .

3 Prefix Search

The prefix search problem is to preprocess a set of strings such that later, we can find all the strings in the set that are prefixed by some query string. Belazzougui et al. [3] consider the weak prefix search problem, a relaxation of the prefix search problem where we are only requested to output the ranks (in lexicographic order) of the strings that are prefixed by the query pattern and we only require no false negatives. Thus we may answer arbitrarily when no strings are prefixed by the query pattern.

Lemma 4 (Belazzougui et al. [3], appendix H.3).

Given a set of strings with average length , from an alphabet of size , we can build a data structure using bits of space supporting weak prefix search for a pattern of length in time where is the word size.

The term stems from preprocessing with an incremental hash function such that the hash of any substring can be obtained in constant time afterwards. Therefore we can do weak prefix search for substrings of in time. We now describe a data structure that builds on the ideas from Lemma 4 but obtains the following:

Lemma 5.

Given a set of strings, we can build a data structure taking space supporting weak prefix search for substrings of a pattern of length in time where is a positive integer.

If we know when building our data structure, we set to and obtain a query time of with Lemma 5.

Before describing our data structure we need the following definition: The 2-fattest number in a nonempty interval of strictly positive integers is the number in the interval whose binary representation has the highest number of trailing zeroes.

3.1 Data Structure

Let  be the compact trie representing the set of strings and let be a positive integer. Denote by fat the 2-fattest number in the skip interval of a vertex . The fat prefix of is the length fat prefix of . Denote by  the set of fat prefixes induced by the vertices of . The -prefix of is the shortest prefix of whose length is a multiple of and is in the interval skip. If ’s skip interval does not span a multiple of , then has no -prefix. Let be the set of -prefixes induced by the vertices of . The data structure is the compact trie augmented with:

  • A fingerprinting function .

  • A dictionary mapping the fingerprints of the strings in  to their associated vertex.

  • A dictionary mapping the fingerprints of the strings in to their associated vertex.

  • For every vertex we store the rank in of the string represented by the leftmost and rightmost leaf in the subtree of , denoted and respectively.

The data structure is similar to the one by Belazzougui et al. [3] except for the dictionary , which we use in the first step of our search.

There are at most strings in each of  and thus the total space of the data structure is .

Let be the start of the skip interval of some vertex and define the pseudo-fat numbers of to be the set of 2-fattest numbers in the intervals where . We use Lemma 2 to find a fingerprinting function that is collision-free for the strings in , the strings in and all the length -prefixes of the strings in where is a pseudo-fat number in the skip interval of some vertex .

Observe that the range of strings in that are prefixed by some pattern of length is exactly where . Answering a weak prefix search query for is comprised by two independent steps. First step is to find a vertex such that str is a prefix of and . We say that is in -range of . Next step is to apply a slightly modified version of the search technique from Belazzougui et al. [3] to find the exit vertex for , that is, the deepest vertex such that is a prefix of . Having found the exit vertex we can find the locus in constant time as it is either the exit vertex itself or one of its children.

3.2 Finding an -range Vertex

We now describe how to find a vertex in -range of . If we simply report that the root of is in -range of . Otherwise, let be the root of and for we check if and is in in which case we update to be the corresponding vertex. Finally, if we report that is and otherwise we report that is in -range of . In the former case, we report as the range of strings in prefixed by . In the latter case we pass on to the next step of the algorithm.

We now show that the algorithm is correct when prefixes a string in . It is easy to verify that the -prefix of prefixes at all time during the execution of the algorithm. Assume that by the end of the algorithm. We will show that in that case , i.e., that is the highest node prefixed by . Since prefixes a string in , the -prefix of prefixes , and , then prefixes . Since the -prefix of prefixes , does not prefix the parent of and thus is the highest node prefixed by .

Assume now that . We will show that is in -range of . Since prefixes a string in and the -prefix of prefixes , then prefixes . Let be the -prefix of . Since is returned, either or for all . If then is not a -prefix of any node in . Since prefixes a string in this implies that is in the skip interval of , i.e., . This means that for all . Therefore and it follows that . We already proved that prefixes and therefore is in -range of .

In case does not prefix any string in we either report that even though or report that is in -range of because even though is not a prefix of due to fingerprint collisions. This may lead to a false positive. However, false positives are allowed in the weak prefix search problem.

Given that we can compute the fingerprint of substrings of in constant time the algorithm uses time.

3.3 From -range to Exit Vertex

We now consider how to find the exit vertex of hereafter denoted . The algorithm is similar to the one presented in Belazzougui et al. [3] except that we support starting the search from not only the root, but from any ancestor of .

Let be any ancestor of , let be the smallest power of two greater than and let be the largest multiple of no greater than . The search progresses by iteratively halving the search interval while using to maintain a candidate for the exit vertex and to decide in which of the two halves to continue the search.

Let be the candidate for the exit vertex and let and be the left and right boundary for our search interval. Initially , and . When , the search terminates and reports . In each iteration, we consider the mid of the interval and update the interval to either or . There are three cases:

  1. is out of bounds

    1. If set to .

    2. If set to .

  2. , let be the corresponding vertex, i.e. .

    1. If , set to and to .

    2. If , report and terminate.

  3. and thus is not in , set to .

Observe that we are guaranteed that all fingerprint comparisons are collision-free in case prefixes a string in . This is because the length of the prefix fingerprints we consider are all either 2-fattest or pseudo-fat in the skip interval of or one of its ancestors and we use a fingerprinting function that is collision-free for these strings.

3.3.1 Correctness

We now show that the invariant is satisfied and that is a prefix of before and after each iteration. After iterations and thus and therefore . Initially is an ancestor of and thus is a prefix of , and so the invariant is true. Now assume that the invariant is true at the beginning of some iteration and consider the possible cases:

  1. is out of bounds

    1. then because , setting to preserves the invariant.

    2. then setting to preserves the invariant.

  2. , let .

    1. then is a prefix of and thus so setting to and to preserves the invariant.

    2. yet . Then is the locus of .

  3. , and thus is not in . As we are not in any of the out of bounds cases we have . Thus, either and setting to preserves the invariant. Otherwise and thus must be in the skip interval of some vertex on the path from to excluding . But is entirely included in and because is 2-fattest in 222If , and is a multiple of then the mid of the interval is 2-fattest in . it is also 2-fattest in . It follows that which contradicts and thus the invariant is preserved.

Thus if prefixes a string in we find either the exit vertex or the locus of . In the former case the locus of is the child of identified by the character . Having found the vertex we report as the range of strings in prefixed by . In case does not prefix any strings in , the fact that the fingerprint of a prefix of match the fingerprint of some fat prefix in does not guarantee equality of the strings. There are two possible consequences of this. Either the search successfully finds what it believes to be the locus of even though in which case we report a false positive. Otherwise, there is no child identified by in which case we can correctly report that no strings in are prefixed by , a true negative. Recall that false positives are allowed as we are considering the weak prefix search problem.

3.3.2 Complexity

The size of the interval is halved in each iteration, thus we do at most iterations, where is the vertex from which we start the search. If we use the technique from the previous section to find a starting vertex in -range of , we do iterations. Each iteration takes constant time. Note that if does not prefix a string in we may have fingerprint collisions and we may be given a starting vertex such that does not prefix . This can lead to a false positive, but we still have and therefore the time complexity remains .

3.4 Multiple Substrings

In order to answer weak prefix search queries for substrings of a pattern of length , we first preprocess in time such that we can compute the fingerprint of any substring of in constant time using Lemma 1. We can then answer a weak prefix search query for any substring of in total time using the techniques described in the previous sections. The total time is therefore .

4 Distinguishing Occurrences

The following sections describe our compressed-index consisting of three independent data structures. One that finds long primary occurrences, one that finds short primary occurrences and one that finds secondary occurrences.

Let be the LZ77 parse of length representing the string of length . If is a phrase of then any substring of is a secondary substring of . These are the substrings of that do not contain any phrase borders. On the other hand, a substring is a primary substring of when there is some phrase where , these are the substrings that contain one or more phrase borders. Any substring of is either primary or secondary. A primary substring that match a query pattern is a primary occurrence of while a secondary substring that match is a secondary occurrence [26].

5 Long Primary Occurrences

For simplicity, we assume that the data structure given in Lemma 5 not only solves the weak prefix problem, but also answers correctly when the query pattern does not prefix any of the indexed strings. Later in Section 5.3 we will see how to lift this assumption. The following data structure and search algorithm is a variation of the classical bidirectional search technique for finding primary occurrences [26].

5.1 Data Structure

For every phrase the strings are relevant substrings unless there is some longer relevant substring ending at position . If is a relevant substring then the string is the associated suffix. There are at most relevant substrings of and equally many associated suffixes. The primary index is comprised by the following:

  • A prefix search data structure  on the set of reversed relevant substrings.

  • A prefix search data structure  on the set of associated suffixes.

  • An orthogonal range reporting data structure on the grid. Consider a relevant substring . Let denote the rank of in the lexicographical order of the reversed relevant substrings, let denote the rank of its associated suffix in the lexicographical order of the associated suffixes. Then is a point in and along with it we store the pair , where is the position of the rightmost phrase border contained in .

Note that every point in is induced by some relevant substring and its associated suffix . If some prefix is a suffix of and the suffix is a prefix of then is an occurrence of and we can compute its exact location from and .

5.2 Searching

The data structure can be used to find the primary occurrences of a pattern of length when . Consider the prefix-suffix pairs for and the pair in case is not a multiple of . For each such pair, we do a prefix search for rev and in and , respectively. If either of these two searches report no matches, we move on to the next pair. Otherwise, let , be the ranges reported from the search in and respectively. Now we do a range reporting query on for the rectangle . For each point reported, let be the pair stored with the point. We report as the starting position of a primary occurrence of in .

Finally, in case is not a multiple of , we need to also check the pair . We search for rev in in and in . If the search for rev reports no match we stop. Otherwise, we do a range reporting query as before. For each point reported, let be the pair stored with the point. To check that the occurrence has not been reported before we do as follows. Let be the smallest positive integer such that . If we report as the starting position of a primary occurrence.

5.2.1 Correctness

We claim that the reported occurrences are exactly the primary occurrences of . We first prove that all primary occurrences are reported correctly. Let be a primary occurrence. As it is a primary occurrence, there must be some phrase such that . Let be the smallest positive integer such that . There are two cases: and . If then is a suffix of the relevant substring ending at . Such a relevant substring exists since . Thus its reverse prefixes a string in , while is a prefix of the associated suffix . Therefore, the respective ranks of and in and are plotted as a point in which stores the pair . We will find this point when considering the prefix-suffix pair , , and correctly report as the starting position of a primary occurrence. If then is a suffix of the relevant substring ending in . Such a relevant substring exists since . Thus its reverse prefixes a string in and trivially is a prefix of the associated suffix. It follows as before that the ranks are plotted as a point in storing the pair and that we find this point when considering the pair . When considering we report as the starting position of a primary occurrence if , and thus is correctly reported.

We now prove that all reported occurrences are in fact primary occurrences. Assume that we report for some and as the starting position of a primary occurrence in the first part of the procedure. Then there exist strings and in and respectively such that is suffixed by and is prefixed by . Therefore is the starting position of an occurrence of . The string is a relevant suffix and therefore there exists a border in the interval . Since the occurrence contains the border and it is therefore a primary occurrence. If we report for some as the starting position of a primary occurrence in the second part of the procedure, then is a prefix of a string in . It follows immediately that is the starting point of an occurrence. Since we have , and by the definition of relevant substring there is a border in the interval . Therefore the occurrence contains the border and is primary.

5.2.2 Complexity

We now consider the time complexity of the algorithm described. First we will argue that any primary occurrence is reported at most once and that the search finds at most two points in identifying it. Let be a primary occurrence reported when we considered the prefix-suffix pair , as in the proof of correctness. None of the pairs , where will identify this occurrence as . None of the pairs , where , will identify this occurrence. This is the case since , and from the definition of relevant substrings it follows that if is a phrase, is a relevant substring and , then . Thus there are no relevant substrings that end after and start before . Therefore, only one of the pairs for identifies the occurrence. If then we might also find the occurrence when considering the pair , but we do not report as .

After preprocessing in time, we can do the prefix searches in total time where is a positive integer by Lemma 5. Using the range reporting data structure by Chan et al. [8] each range reporting query takes time where and is the number of points reported. As each such point in one range reporting query corresponds to the identification of a unique primary occurrence of , which happens at most twice for every occurrence we charge to reporting the occurrences. The total time to find all primary occurrences is thus where is the number of primary and secondary occurrences of .

5.3 Prefix Search Verification

The prefix data structure from Lemma 5 gives no guarantees of correct answers when the query pattern does not prefix any of the indexed strings. If the prefix search gives false-positives, we may end up reporting occurrences of that are not actually there. We show how to solve this problem after introducing a series of tools that we will need.

5.3.1 Straight Line Programs

A straight line program (SLP) for a string is a context-free grammar generating the single string .

Lemma 6 (Rytter [38], Charikar et al. [9]).

Given an LZ77 parse of length producing a string of length we can construct a SLP for of size in time .

The construction from Rytter [38] produces a balanced grammar for every consecutive substring of length of after a preprocessing step transforms such that no compression element is longer than . These grammars are then connected to form a single balanced grammar of height which immediately yields extraction of any substring in time . We give a simple solution to reduce this to , that also supports computation of the fingerprint of a substring in time.

Lemma 7.

Given an LZ77 parse of length producing a string of length we can build a data structure that for any substring can extract in time and compute the fingerprint in time. The data structure uses space and construction time.

Proof.

Assume for simplicity that is a multiple of . We construct the SLP producing from . Along with every non-terminal of the SLP we store the size and fingerprint of its expansion. Let be consecutive length substrings of . We store the balanced grammar producing along with the fingerprint at index in a table .

Now we can extract in time and any substring in time . Also, we can compute the fingerprint in time. We can easily do a constant time mapping from a position in to the grammar in producing the substring covering that position and the corresponding position inside the substring. But then any fingerprint can be computed in time . Now consider a substring that starts in and ends in . We extract in time by extracting the appropriate suffix of , all of for and the appropriate prefix of . Each of the fingerprints stored by the data structure can be computed in time after preprocessing in time. Thus table is filled in time and by Lemma 6 the SLPs stored in uses a total of space and construction time. ∎

5.3.2 Verification of Fingerprints

We need the following lemma for the verification.

Lemma 8 (Bille et al. [6]).

Given a string of length , we can find a fingerprinting function that is collision-free for all length substrings of where is a power of two in expected time.

5.3.3 Verification Technique

Our verification technique is identical to the one given by Gagie et al. [20] and involves a simple modification of the search for long primary occurrences. By using Lemma 7 instead of bookmarking [20] for extraction and fingerprinting and because we only need to verify strings, the verification procedure takes time and uses space.

Consider the string of length that we wish to index and let be the parse of . The verification data structure is given by Lemma 7. Consider the prefix search data structure as given in Section 5.1 and let be the fingerprinting function used by the prefix search, the case for is symmetric. We alter the search for primary occurrences such that it first does the prefix searches, then verifies the results and discards false-positives before moving on to do the range reporting queries on the verified results. We also modify using Lemma 8 to be collision-free for all substrings of the indexed strings which length is a power of two.

Let be the all the suffixes of for which the prefix search found a locus candidate, let the candidates be and let be . Assume that , and let 2-suf and 2-pre denote the fingerprints using of the suffix and prefix respectively of length of some string . The verification progresses in iterations. Initially, let , and for each iteration do as follows:

  1. or : Discard and set and .

  2. and , let .

    1. and : set and .

    2. or : discard and set .

  3. : If all vertices have been discarded, report no matches. Otherwise, let be the last vertex considered, that was not discarded. Report all non-discarded vertices where is no longer than the longest common suffix of and as verified and discard the rest.

Consider the correctness and complexity of the algorithm. In case 1, clearly, does not match and thus must be a false-positive. Now observe that because is a suffix of , it is also a suffix of for any . Thus in case 2 (b), if does not match then must be a false-positive. In case 2 (a), both and may still be false-positives, yet by Lemma 8, is a suffix of because and . Finally, in case , is a true positive if and only if . But any other non-discarded vertex is also only a true positive if and share a length suffix because is a suffix of and is a suffix of .

The algorithm does iterations and fingerprints of substrings of can be computed in constant time after preprocessing. Every vertex represents one or more substrings of . If we store the starting index in of one of these substrings in when constructing we can compute the fingerprint of any substring by computing the fingerprint of where is the starting index of one of the substring of that represents. By Lemma 7, the fingerprint computations take time and because the total time complexity of the algorithm is .

6 Short Primary Occurrences

We now describe a simple data structure that can find primary occurrences of in time using space whenever where is a positive integer.

Let be the LZ77 parse of the string of length . Let and define to be the union of the strings where for . There are at most such strings, each of length and they are all suffixes of the length substrings of starting positions before each border position. We store these substrings along with the compact trie over the strings in . The edge labels of are compactly represented by storing references into one of the substrings. Every leaf stores the starting position in of the string it represents and the position of the leftmost border it contains.

The combined size of and the substrings we store is and we simply search for by navigating vertices using perfect hashing [18] and matching edge labels character by character. Now either in which case there are no primary occurrences of in ; otherwise, for some vertex and thus every leaf in the subtree of represents a substring of that is prefixed by . By using the indices stored with the leaves, we can determine the starting position for each occurrence and if it is primary or secondary. Because each of the strings in start at different positions in , we will only find an occurrence once. Also, it is easy to see that we will find all primary occurrences because of how the strings in are chosen. It follows that the time complexity is where is the number of primary and secondary occurrences.

7 The Secondary Index

Let be the LZ77 parse of length representing the string of length . We find the secondary occurrences by applying the most recent range reporting data structure by Chan et al. [8] to the technique described by Kärkkäinen and Ukkonen [26] which is inspired by the ideas of Farach and Thorup [13].

Let be the starting positions of the occurrences of in ordered increasingly. Assume that is a secondary occurrence such that . Then by definition, is a substring the prefix of some phrase and there must be an occurrence of in the source of that phrase. More precise, let be the source of the phrase then is an occurrence of for some . We say that , which may be primary or secondary, is the source occurrence of the secondary occurrence given the LZ77 parse of . Thus every secondary occurrence has a source occurrence. Note that it follows from the definition that no primary occurrence has a source occurrence.

We find the secondary occurrences as follows: Build a range reporting data structure on the grid and if is a phrase with source we plot a point and along with it we store the phrase start .

Now for each primary occurrence found by the primary index, we query for the rectangle . The points returned are exactly the occurrences having as source. For each point and phrase start reported, we report an occurrence and recurse on to find all the occurrences having as source.

Because no primary occurrence have a source, while all secondary occurrences have a source, we will find exactly the secondary occurrences.

The range reporting structure is built using Lemma 3 with and uses space . Exactly one range reporting query is done for each primary and secondary occurrence each taking where is the number of points reported. Each reported point identifies a secondary occurrence, so the total time is .

8 The Compressed Index

We obtain our final index by combining the primary index, the verification data structure and the secondary index. We use the transformed LZ77 parse generated by Lemma 6 when building our primary index. Therefore no phrase will be longer than and therefore any primary occurrence of will have a prefix where that is a suffix of some phrase. It then follows that we need only consider the multiples for when searching for long primary occurrences. This yields the following complexities:

  • time and space for the index finding long primary occurrences where and are positive integers and .

  • time and space for the index finding short primary occurrences.

  • time and space for the verification data structure.

  • time and space for the secondary index.

If we fix at we have in which case we obtain the following trade-off simply by combining the above complexities.

Theorem 2.

Given a string of length from an alphabet of size compressed using LZ77 to a string of length we can build a compressed-index supporting substring queries in