Time-Space Trade-Offs for Lempel-Ziv Compressed IndexingA preliminary version of this paper appeared in the Proceedings of the 28th Annual symposium on Combinatorial Pattern Matching, 2017

Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing1

Abstract

Given a string , the compressed indexing problem is to preprocess into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let , and denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string of length , we can solve the problem in

  1. time using space, or

  2. time using space, for any

In particular, (i) improves the leading term in the query time of the previous best solution from to at the cost of increasing the space by a factor . Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes . Our index also supports extraction of any substring of length in time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

1Introduction

Given a string , the compressed indexing problem is to preprocess into a compressed representation that supports fast substring queries, that is, given a string , report all occurrences of substrings in that match . Here the compressed representation can be any compression scheme or measure (th order entropy, smallest grammar, Lempel-Ziv, etc.). The goal is to use little space relative to the compressed size of while supporting fast queries. Compressed indexing is a key computational primitive for querying massive data sets and the area has received significant attention over the last decades with numerous theoretical and practical solutions, see e.g. [26] and the surveys [35].

The Lempel-Ziv 1977 compression scheme (LZ77) [39] is a classic compression scheme based on replacing repetitions by references in a greedy left-to-right order. Numerous variants of LZ77 have been developed and several widely used implementations are available (such as gzip [1]). Recently, LZ77 has been shown to be particularly effective at handling highly-repetitive data sets [31] and LZ77 compression is always at least as powerful as any grammar representation [38].

In this paper, we consider compressed indexing based on LZ77 compression. Relatively few results are known for this version of the problem. Let , , and denote the size of the input string, the compressed LZ77 string, and the pattern string, respectively. Kärkkäinen and Ukkonen introduced the problem in 1996 [26] and gave an initial solution that required read-only access to the uncompressed text. Interestingly, this work is among the first results in compressed indexing [35]. More recently, Gagie et al. [19] revisited the problem and gave a solution using space and query time , where is the number of occurrences of in . Note that these bounds assume a constant sized alphabet.

1.1Our Results

We show the following main result.

Compared to the previous bounds Thm. ? obtains new interesting trade-offs. In particular, Thm. ? (i) improves the leading term in the query time of the previous best solution from to at the cost of increasing the space by only a factor . Alternatively, Thm. ? (ii) matches the previous best space bound, but has a leading term in the query time of . However, for any polynomial compression ratio, i.e., , for constant , this becomes .

Gagie et al. [20] also showed how to extract an arbitrary substring of of length in time . We show how to support the same extraction operation and slightly improve the time to .

Technically, our results are obtained by new variants and extensions of existing data structures in novel combinations. In particular, we consider a batched variant of the weak prefix search problem and give the first non-trivial solution to it. We also generalize the well-known bidirectional compact trie search technique [29] to reduce the number of queries at the cost of increasing space. Finally, we show how to combine this efficiently with range reporting and fast random-access in a balanced grammar leading to the result.

As mentioned all of the above bounds hold for a constant size alphabet. However, Thm. ? is an instance of full time-space trade-off that also supports general alphabets. We discuss the details in Section 8.

2Preliminaries

We assume a standard unit-cost RAM model with word size and that the input is from an integer alphabet and measure space complexity in words unless otherwise specified.

A string of length is a sequence of characters drawn from . The string denoted is called a substring of . is the empty string and while when . The substrings and are the prefix and the suffix of respectively. The reverse of the string is denoted . We use the results from Fredman et al. [18] when referring to perfect hashing allowing us to build a dictionary on constant sized keys in expected time supporting constant time lookups.

2.1Compact Tries

A trie for a set of strings is a rooted tree where the vertices corresponds to the prefixes of the strings in . denotes the prefix corresponding to the vertex . if is the root while is the parent of if is equal to without the last character. This character is then the of the edge from to . The depth of vertex is the number of edges on the path from to the root.

We assume each string in is terminated by a special character such that each string in corresponds to a leaf. The children of each vertex are sorted from left to right in increasing lexicographical order, and therefore the left to right order of the leaves corresponds to the lexicographical order of the strings in . Let rank denote the rank of the string in this order.

A compact trie for denoted is obtained from the trie by removing all vertices with exactly one child excluding the root and replacing the two edges incident to with a single edge from its parent to its child. This edge is then labeled with the concatenation of the edge labels it replaces, thus the edges of a compact trie may be labeled by strings. The skip interval of a vertex with parent is denoted and if is the root. The locus of a string in , denoted locus, is the minimum depth vertex such that is a prefix of str. If there is no such vertex, then .

In order to reduce the space used by we only store the first character of every edge and in every vertex we store (This variation is also known as a PATRICIA tree [32]). We navigate by storing a dictionary in every internal vertex mapping the first character of the label of an edge to the respective child. The size of is .

2.2Karp-Rabin Fingerprints

A Karp-Rabin fingerprinting function [27] is a randomized hash function for strings. We use a variation of the original definition appearing in Porat and Porat [37]. The fingerprint for a string of length is defined as:

where is a prime and is a random integer in (the field of integers modulo ). Storing the values , and along with a fingerprint allows for efficient composition an subtraction of fingerprints:

It follows that we can compute and store the fingerprints of each of the prefixes of a string of length in time and space such that we afterwards can compute the fingerprint of any substring in constant time. We say that the fingerprints of the strings and collide when and . A fingerprinting function is collision-free for a set of strings if there are no fingerprint collisions between any of the strings.

2.3Range Reporting

Let be a set of points in a d-dimensional grid. The orthogonal range reporting problem in -dimensions is to compactly represent while supporting range reporting queries, that is, given a rectangle report all points in the set . We use the following results for 2-dimensional range reporting:

2.4Lz77

The Ziv-Lempel algorithm from 1977 [39] provides a simple and natural way to compress strings.

The LZ77 parse of a string of length is a sequence of subsequent substrings of called phrases such that . is constructed in a left to right pass of : Assume that we have found the sequence producing the string and let be the longest prefix of that is also a substring of . Then . The occurrence of in is called the source of the phrase . Thus a phrase is composed by the contents of its possibly empty source and a trailing character which we call the phrase border and is typically represented as a triple where start is the starting position of the source, len is the length of the source and is the border. For a phrase we denote the position of its border by and its source by . For example, the string of length has the LZ77 parse of length 4 which is represented as .

3Prefix Search

The prefix search problem is to preprocess a set of strings such that later, we can find all the strings in the set that are prefixed by some query string. Belazzougui et al. [3] consider the weak prefix search problem, a relaxation of the prefix search problem where we are only requested to output the ranks (in lexicographic order) of the strings that are prefixed by the query pattern and we only require no false negatives. Thus we may answer arbitrarily when no strings are prefixed by the query pattern.

The term stems from preprocessing with an incremental hash function such that the hash of any substring can be obtained in constant time afterwards. Therefore we can do weak prefix search for substrings of in time. We now describe a data structure that builds on the ideas from Lemma ? but obtains the following:

If we know when building our data structure, we set to and obtain a query time of with Lemma ?.

Before describing our data structure we need the following definition: The 2-fattest number in a nonempty interval of strictly positive integers is the number in the interval whose binary representation has the highest number of trailing zeroes.

3.1Data Structure

Let be the compact trie representing the set of strings and let be a positive integer. Denote by fat the 2-fattest number in the skip interval of a vertex . The fat prefix of is the length fat prefix of . Denote by the set of fat prefixes induced by the vertices of . The -prefix of is the shortest prefix of whose length is a multiple of and is in the interval skip. If ’s skip interval does not span a multiple of , then has no -prefix. Let be the set of -prefixes induced by the vertices of . The data structure is the compact trie augmented with:

  • A fingerprinting function .

  • A dictionary mapping the fingerprints of the strings in to their associated vertex.

  • A dictionary mapping the fingerprints of the strings in to their associated vertex.

  • For every vertex we store the rank in of the string represented by the leftmost and rightmost leaf in the subtree of , denoted and respectively.

The data structure is similar to the one by Belazzougui et al. [3] except for the dictionary , which we use in the first step of our search.

There are at most strings in each of and thus the total space of the data structure is .

Let be the start of the skip interval of some vertex and define the pseudo-fat numbers of to be the set of 2-fattest numbers in the intervals where . We use Lemma ? to find a fingerprinting function that is collision-free for the strings in , the strings in and all the length -prefixes of the strings in where is a pseudo-fat number in the skip interval of some vertex .

Observe that the range of strings in that are prefixed by some pattern of length is exactly where . Answering a weak prefix search query for is comprised by two independent steps. First step is to find a vertex such that str is a prefix of and . We say that is in -range of . Next step is to apply a slightly modified version of the search technique from Belazzougui et al. [3] to find the exit vertex for , that is, the deepest vertex such that is a prefix of . Having found the exit vertex we can find the locus in constant time as it is either the exit vertex itself or one of its children.

3.2Finding an x-range Vertex

We now describe how to find a vertex in -range of . If we simply report that the root of is in -range of . Otherwise, let be the root of and for we check if and is in in which case we update to be the corresponding vertex. Finally, if we report that is and otherwise we report that is in -range of . In the former case, we report as the range of strings in prefixed by . In the latter case we pass on to the next step of the algorithm.

We now show that the algorithm is correct when prefixes a string in . It is easy to verify that the -prefix of prefixes at all time during the execution of the algorithm. Assume that by the end of the algorithm. We will show that in that case , i.e., that is the highest node prefixed by . Since prefixes a string in , the -prefix of prefixes , and , then prefixes . Since the -prefix of prefixes , does not prefix the parent of and thus is the highest node prefixed by .

Assume now that . We will show that is in -range of . Since prefixes a string in and the -prefix of prefixes , then prefixes . Let be the -prefix of . Since is returned, either or for all . If then is not a -prefix of any node in . Since prefixes a string in this implies that is in the skip interval of , i.e., . This means that for all . Therefore and it follows that . We already proved that prefixes and therefore is in -range of .

In case does not prefix any string in we either report that even though or report that is in -range of because even though is not a prefix of due to fingerprint collisions. This may lead to a false positive. However, false positives are allowed in the weak prefix search problem.

Given that we can compute the fingerprint of substrings of in constant time the algorithm uses time.

3.3From x-range to Exit Vertex

We now consider how to find the exit vertex of hereafter denoted . The algorithm is similar to the one presented in Belazzougui et al. [3] except that we support starting the search from not only the root, but from any ancestor of .

Let be any ancestor of , let be the smallest power of two greater than and let be the largest multiple of no greater than . The search progresses by iteratively halving the search interval while using to maintain a candidate for the exit vertex and to decide in which of the two halves to continue the search.

Let be the candidate for the exit vertex and let and be the left and right boundary for our search interval. Initially , and . When , the search terminates and reports . In each iteration, we consider the mid of the interval and update the interval to either or . There are three cases:

  1. is out of bounds

    1. If set to .

    2. If set to .

  2. , let be the corresponding vertex, i.e. .

    1. If , set to and to .

    2. If , report and terminate.

  3. and thus is not in , set to .

Observe that we are guaranteed that all fingerprint comparisons are collision-free in case prefixes a string in . This is because the length of the prefix fingerprints we consider are all either 2-fattest or pseudo-fat in the skip interval of or one of its ancestors and we use a fingerprinting function that is collision-free for these strings.

Correctness

We now show that the invariant is satisfied and that is a prefix of before and after each iteration. After iterations and thus and therefore . Initially is an ancestor of and thus is a prefix of , and so the invariant is true. Now assume that the invariant is true at the beginning of some iteration and consider the possible cases:

  1. is out of bounds

    1. then because , setting to preserves the invariant.

    2. then setting to preserves the invariant.

  2. , let .

    1. then is a prefix of and thus so setting to and to preserves the invariant.

    2. yet . Then is the locus of .

  3. , and thus is not in . As we are not in any of the out of bounds cases we have . Thus, either and setting to preserves the invariant. Otherwise and thus must be in the skip interval of some vertex on the path from to excluding . But is entirely included in and because is 2-fattest in 2 it is also 2-fattest in . It follows that which contradicts and thus the invariant is preserved.

Thus if prefixes a string in we find either the exit vertex or the locus of . In the former case the locus of is the child of identified by the character . Having found the vertex we report as the range of strings in prefixed by . In case does not prefix any strings in , the fact that the fingerprint of a prefix of match the fingerprint of some fat prefix in does not guarantee equality of the strings. There are two possible consequences of this. Either the search successfully finds what it believes to be the locus of even though in which case we report a false positive. Otherwise, there is no child identified by in which case we can correctly report that no strings in are prefixed by , a true negative. Recall that false positives are allowed as we are considering the weak prefix search problem.

Complexity

The size of the interval is halved in each iteration, thus we do at most iterations, where is the vertex from which we start the search. If we use the technique from the previous section to find a starting vertex in -range of , we do iterations. Each iteration takes constant time. Note that if does not prefix a string in we may have fingerprint collisions and we may be given a starting vertex such that does not prefix . This can lead to a false positive, but we still have and therefore the time complexity remains .

3.4Multiple Substrings

In order to answer weak prefix search queries for substrings of a pattern of length , we first preprocess in time such that we can compute the fingerprint of any substring of in constant time using Lemma ?. We can then answer a weak prefix search query for any substring of in total time using the techniques described in the previous sections. The total time is therefore .

4Distinguishing Occurrences

The following sections describe our compressed-index consisting of three independent data structures. One that finds long primary occurrences, one that finds short primary occurrences and one that finds secondary occurrences.

Let be the LZ77 parse of length representing the string of length . If is a phrase of then any substring of is a secondary substring of . These are the substrings of that do not contain any phrase borders. On the other hand, a substring is a primary substring of when there is some phrase where , these are the substrings that contain one or more phrase borders. Any substring of is either primary or secondary. A primary substring that match a query pattern is a primary occurrence of while a secondary substring that match is a secondary occurrence [26].

5Long Primary Occurrences

For simplicity, we assume that the data structure given in Lemma ? not only solves the weak prefix problem, but also answers correctly when the query pattern does not prefix any of the indexed strings. Later in Section 5.3 we will see how to lift this assumption. The following data structure and search algorithm is a variation of the classical bidirectional search technique for finding primary occurrences [26].

5.1Data Structure

For every phrase the strings are relevant substrings unless there is some longer relevant substring ending at position . If is a relevant substring then the string is the associated suffix. There are at most relevant substrings of and equally many associated suffixes. The primary index is comprised by the following:

  • A prefix search data structure on the set of reversed relevant substrings.

  • A prefix search data structure on the set of associated suffixes.

  • An orthogonal range reporting data structure on the grid. Consider a relevant substring . Let denote the rank of in the lexicographical order of the reversed relevant substrings, let denote the rank of its associated suffix in the lexicographical order of the associated suffixes. Then is a point in and along with it we store the pair , where is the position of the rightmost phrase border contained in .

Note that every point in is induced by some relevant substring and its associated suffix . If some prefix is a suffix of and the suffix is a prefix of then is an occurrence of and we can compute its exact location from and .

5.2Searching

The data structure can be used to find the primary occurrences of a pattern of length when . Consider the prefix-suffix pairs for and the pair in case is not a multiple of . For each such pair, we do a prefix search for rev and in and , respectively. If either of these two searches report no matches, we move on to the next pair. Otherwise, let , be the ranges reported from the search in and respectively. Now we do a range reporting query on for the rectangle . For each point reported, let be the pair stored with the point. We report as the starting position of a primary occurrence of in .

Finally, in case is not a multiple of , we need to also check the pair . We search for rev in in and in . If the search for rev reports no match we stop. Otherwise, we do a range reporting query as before. For each point reported, let be the pair stored with the point. To check that the occurrence has not been reported before we do as follows. Let be the smallest positive integer such that . If we report as the starting position of a primary occurrence.

Correctness

We claim that the reported occurrences are exactly the primary occurrences of . We first prove that all primary occurrences are reported correctly. Let be a primary occurrence. As it is a primary occurrence, there must be some phrase such that . Let be the smallest positive integer such that . There are two cases: and . If then is a suffix of the relevant substring ending at . Such a relevant substring exists since . Thus its reverse prefixes a string in , while is a prefix of the associated suffix . Therefore, the respective ranks of and in and are plotted as a point in which stores the pair . We will find this point when considering the prefix-suffix pair , , and correctly report as the starting position of a primary occurrence. If then is a suffix of the relevant substring ending in . Such a relevant substring exists since . Thus its reverse prefixes a string in and trivially is a prefix of the associated suffix. It follows as before that the ranks are plotted as a point in storing the pair and that we find this point when considering the pair . When considering we report as the starting position of a primary occurrence if , and thus is correctly reported.

We now prove that all reported occurrences are in fact primary occurrences. Assume that we report for some and as the starting position of a primary occurrence in the first part of the procedure. Then there exist strings and in and respectively such that is suffixed by and is prefixed by . Therefore is the starting position of an occurrence of . The string is a relevant suffix and therefore there exists a border in the interval . Since the occurrence contains the border and it is therefore a primary occurrence. If we report for some as the starting position of a primary occurrence in the second part of the procedure, then is a prefix of a string in . It follows immediately that is the starting point of an occurrence. Since we have , and by the definition of relevant substring there is a border in the interval . Therefore the occurrence contains the border and is primary.

Complexity

We now consider the time complexity of the algorithm described. First we will argue that any primary occurrence is reported at most once and that the search finds at most two points in identifying it. Let be a primary occurrence reported when we considered the prefix-suffix pair , as in the proof of correctness. None of the pairs , where will identify this occurrence as . None of the pairs , where , will identify this occurrence. This is the case since , and from the definition of relevant substrings it follows that if is a phrase, is a relevant substring and , then . Thus there are no relevant substrings that end after and start before . Therefore, only one of the pairs for identifies the occurrence. If then we might also find the occurrence when considering the pair , but we do not report as .

After preprocessing in time, we can do the prefix searches in total time where is a positive integer by Lemma ?. Using the range reporting data structure by Chan et al. [8] each range reporting query takes time where and is the number of points reported. As each such point in one range reporting query corresponds to the identification of a unique primary occurrence of , which happens at most twice for every occurrence we charge to reporting the occurrences. The total time to find all primary occurrences is thus where is the number of primary and secondary occurrences of .

5.3Prefix Search Verification

The prefix data structure from Lemma ? gives no guarantees of correct answers when the query pattern does not prefix any of the indexed strings. If the prefix search gives false-positives, we may end up reporting occurrences of that are not actually there. We show how to solve this problem after introducing a series of tools that we will need.

Straight Line Programs

A straight line program (SLP) for a string is a context-free grammar generating the single string .

The construction from Rytter [38] produces a balanced grammar for every consecutive substring of length of after a preprocessing step transforms such that no compression element is longer than . These grammars are then connected to form a single balanced grammar of height which immediately yields extraction of any substring in time . We give a simple solution to reduce this to , that also supports computation of the fingerprint of a substring in time.

Assume for simplicity that is a multiple of . We construct the SLP producing from . Along with every non-terminal of the SLP we store the size and fingerprint of its expansion. Let be consecutive length substrings of . We store the balanced grammar producing along with the fingerprint at index in a table .

Now we can extract in time and any substring in time . Also, we can compute the fingerprint in time. We can easily do a constant time mapping from a position in to the grammar in producing the substring covering that position and the corresponding position inside the substring. But then any fingerprint can be computed in time . Now consider a substring that starts in and ends in . We extract in time by extracting the appropriate suffix of , all of for and the appropriate prefix of . Each of the fingerprints stored by the data structure can be computed in time after preprocessing in time. Thus table is filled in time and by Lemma ? the SLPs stored in uses a total of space and construction time.

Verification of Fingerprints

We need the following lemma for the verification.

Verification Technique

Our verification technique is identical to the one given by Gagie et al. [20] and involves a simple modification of the search for long primary occurrences. By using Lemma ? instead of bookmarking [20] for extraction and fingerprinting and because we only need to verify strings, the verification procedure takes time and uses space.

Consider the string of length that we wish to index and let be the parse of . The verification data structure is given by Lemma ?. Consider the prefix search data structure as given in Section 5.1 and let be the fingerprinting function used by the prefix search, the case for is symmetric. We alter the search for primary occurrences such that it first does the prefix searches, then verifies the results and discards false-positives before moving on to do the range reporting queries on the verified results. We also modify using Lemma ? to be collision-free for all substrings of the indexed strings which length is a power of two.

Let be the all the suffixes of for which the prefix search found a locus candidate, let the candidates be and let be . Assume that , and let 2-suf and 2-pre denote the fingerprints using of the suffix and prefix respectively of length of some string . The verification progresses in iterations. Initially, let , and for each iteration do as follows:

  1. or : Discard and set and .

  2. and , let .

    1. and : set and .

    2. or : discard and set .

  3. : If all vertices have been discarded, report no matches. Otherwise, let be the last vertex considered, that was not discarded. Report all non-discarded vertices where is no longer than the longest common suffix of and as verified and discard the rest.

Consider the correctness and complexity of the algorithm. In case 1, clearly, does not match and thus must be a false-positive. Now observe that because is a suffix of , it is also a suffix of for any . Thus in case 2 (b), if does not match then must be a false-positive. In case 2 (a), both and may still be false-positives, yet by Lemma ?, is a suffix of because and . Finally, in case , is a true positive if and only if . But any other non-discarded vertex is also only a true positive if and share a length suffix because is a suffix of and is a suffix of .

The algorithm does iterations and fingerprints of substrings of can be computed in constant time after preprocessing. Every vertex represents one or more substrings of . If we store the starting index in of one of these substrings in when constructing we can compute the fingerprint of any substring by computing the fingerprint of where is the starting index of one of the substring of that represents. By Lemma ?, the fingerprint computations take time and because the total time complexity of the algorithm is .

6Short Primary Occurrences

We now describe a simple data structure that can find primary occurrences of in time using space whenever where is a positive integer.

Let be the LZ77 parse of the string of length . Let and define to be the union of the strings where for . There are at most such strings, each of length and they are all suffixes of the length substrings of starting positions before each border position. We store these substrings along with the compact trie over the strings in . The edge labels of are compactly represented by storing references into one of the substrings. Every leaf stores the starting position in of the string it represents and the position of the leftmost border it contains.

The combined size of and the substrings we store is and we simply search for by navigating vertices using perfect hashing [18] and matching edge labels character by character. Now either in which case there are no primary occurrences of in ; otherwise, for some vertex and thus every leaf in the subtree of represents a substring of that is prefixed by . By using the indices stored with the leaves, we can determine the starting position for each occurrence and if it is primary or secondary. Because each of the strings in start at different positions in , we will only find an occurrence once. Also, it is easy to see that we will find all primary occurrences because of how the strings in are chosen. It follows that the time complexity is where is the number of primary and secondary occurrences.

7The Secondary Index

Let be the LZ77 parse of length representing the string of length . We find the secondary occurrences by applying the most recent range reporting data structure by Chan et al. [8] to the technique described by Kärkkäinen and Ukkonen [26] which is inspired by the ideas of Farach and Thorup [13].

Let be the starting positions of the occurrences of in ordered increasingly. Assume that is a secondary occurrence such that . Then by definition, is a substring the prefix of some phrase and there must be an occurrence of in the source of that phrase. More precise, let be the source of the phrase then is an occurrence of for some . We say that , which may be primary or secondary, is the source occurrence of the secondary occurrence given the LZ77 parse of . Thus every secondary occurrence has a source occurrence. Note that it follows from the definition that no primary occurrence has a source occurrence.

We find the secondary occurrences as follows: Build a range reporting data structure on the grid and if is a phrase with source we plot a point and along with it we store the phrase start .

Now for each primary occurrence found by the primary index, we query for the rectangle . The points returned are exactly the occurrences having as source. For each point and phrase start reported, we report an occurrence and recurse on to find all the occurrences having as source.

Because no primary occurrence have a source, while all secondary occurrences have a source, we will find exactly the secondary occurrences.

The range reporting structure is built using Lemma ? with and uses space . Exactly one range reporting query is done for each primary and secondary occurrence each taking where is the number of points reported. Each reported point identifies a secondary occurrence, so the total time is .

8The Compressed Index

We obtain our final index by combining the primary index, the verification data structure and the secondary index. We use the transformed LZ77 parse generated by Lemma ? when building our primary index. Therefore no phrase will be longer than and therefore any primary occurrence of will have a prefix where that is a suffix of some phrase. It then follows that we need only consider the multiples for when searching for long primary occurrences. This yields the following complexities:

  • time and space for the index finding long primary occurrences where and are positive integers and .

  • time and space for the index finding short primary occurrences.

  • time and space for the verification data structure.

  • time and space for the secondary index.

If we fix at we have in which case we obtain the following trade-off simply by combining the above complexities.

We note that none of our data structures assume constant sized alphabet and thus Thm. ? holds for any alphabet size.

8.1Trade-offs

Thm. ? gives rise to a series of interesting time-space trade-offs.

For set and , for set and , for set and for some , for set and , for set and .

The leading term in the time complexity of Cor. ? is whenever which is true when , i.e. for all strings that are compressible by at least a logarithmic fraction. For we have all strings [36] and thus Thm. ? (i) follows immediately. Cor. ? matches previous best space bounds but obtains a leading term of for any polynomial compression rate. Thm. ? is a weaker version of this because it assumes constant sized alphabet and therefore follows immediately. Cor. ? matches the space and time for reporting occurrences of previous best bounds by Gagie et al. [20] but with a leading term of compared to a leading term of . Cor. ? and show how to guarantee the fast query times with leading term without the assumptions on compression ratio that and require to match this, but at the cost of increased space.

8.2Preprocessing

We now consider the preprocessing time of the data structure. Let be the LZ77 parse of the string of length let and be the compact tries used in the index for long primary occurrences. The compact trie index substrings of with overall length . Thus we can construct the trie in time by sorting the strings and successively inserting them in their sorted order [2]. The compact tries index suffixes of and can be built in time using space [12]. The index for short primary occurrences is a generalized suffix tree over strings of length with total length and is therefore also built in time. The dictionaries used by the prefix search data structures and for trie navigation contain keys and are built in expected linear time using perfect hashing [18]. The range reporting data structures used by the primary and secondary index over points are built in expected time using Lemma ?.

Building the SLP for our verification data structure takes time using Lemma ? and finding an appropriate fingerprinting function takes expected time using Lemma ?. The prefix search data structures and also require that is collision-free for the -prefixes, fat prefixes and the prefixes with pseudo fat lengths. There are at most such prefixes [3]. If we compute these fingerprints incrementally while doing a traversal of the tries, we expect all the fingerprints to be unique. We simply check this by sorting the fingerprints in linear time and checking for duplicates by doing a linear scan. If we choose a prime with Lemma ? then the probability of a collision between any two strings is and by a union bound over the possible collisions the probability that is collision-free is at least . Thus the expected time to find our required fingerprinting function is .

All in all, the preprocessing time for our combined index is therefore expected .

Footnotes

  1. A preliminary version of this paper appeared in the Proceedings of the 28th Annual symposium on Combinatorial Pattern Matching, 2017
  2. If , and is a multiple of then the mid of the interval is 2-fattest in .

References

  1. .

    www.gzip.org

  2. A new efficient radix sort, 1994.
    Arne Andersson and Stefan Nilsson.
  3. Fast prefix search in little space, with applications.
    Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. In Proc. 18th ESA, pages 427–438, 2010.
  4. Composite repetition-aware data structures.
    Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. In Proc. 26st CPM, pages 26–39, 2015.
  5. Queries on lz-bounded encodings.
    Djamal Belazzougui, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, 2015 Data Compression Conference, DCC 2015, Snowbird, UT, USA, April 7-9, 2015, pages 83–92. IEEE, 2015.
  6. Time-space trade-offs for longest common extensions.
    Philip Bille, Inge Li Gørtz, Benjamin Sach, and Hjalte Wedel Vildhøj. In Proc. 23rd CPM, 2012.
  7. Real-time streaming string-matching.
    Dany Breslauer and Zvi Galil. ACM Trans. Algorithms, 10(4):22:1–22:12, 2014.
  8. Orthogonal range searching on the ram, revisited.
    Timothy M. Chan, Kasper Green Larsen, and Mihai Patrascu. In Proc. 27th SOCG, pages 1–10, 2011.
  9. The smallest grammar problem.
    Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. IEEE Trans. Information Theory, 51(7):2554–2576, 2005.
  10. Universal indexes for highly repetitive document collections.
    Francisco Claude, Antonio Fariña, Miguel A. Martínez-Prieto, and Gonzalo Navarro. Inf. Syst., 61:1–23, 2016.
  11. Improved grammar-based compressed indexes.
    Francisco Claude and Gonzalo Navarro. In Proc. 19th SPIRE, pages 180–192, 2012.
  12. Optimal suffix tree construction with large alphabets.
    M. Farach. In Proceedings of the 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, pages 137–, Washington, DC, USA, 1997. IEEE Computer Society.
  13. String matching in lempel-ziv compressed strings.
    Martin Farach and Mikkel Thorup. Algorithmica, 20(4):388–404, 1998.
  14. Opportunistic data structures with applications.
    P. Ferragina and G. Manzini. In Proc. 41st FOCS, pages 390–398, 2000.
  15. An experimental study of an opportunistic index.
    Paolo Ferragina and Giovanni Manzini. In Proc. 12th SODA, pages 269–278, 2001.
  16. Indexing compressed text.
    Paolo Ferragina and Giovanni Manzini. J. ACM, 52(4):552–581, 2005.
  17. Compressed representations of sequences and full-text indexes.
    Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. ACM Trans. Algorithms, 3(2), 2007.
  18. Storing a sparse table with 0(1) worst case access time.
    Michael L. Fredman, János Komlós, and Endre Szemerédi. J. ACM, 31(3):538–544, 1984.
  19. A faster grammar-based self-index.
    Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. In Proc. 6th LATA, pages 240–251, 2012.
  20. LZ77-based self-indexing with faster pattern matching.
    Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J Puglisi. In Proc. 11th LATIN, pages 731–742, 2014.
  21. Searching and indexing genomic databases via kernelization.
    Travis Gagie and Simon J. Puglisi. Frontiers in Bioengineering and Biotechnology, 3:12, 2015.
  22. High-order entropy-compressed text indexes.
    Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. In Proc. 14th SODA, pages 841–850, 2003.
  23. When indexing equals compression: Experiments with compressing suffix arrays and applications.
    Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. In Proc. 15th SODA, pages 636–645, 2004.
  24. Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
    Roberto Grossi and Jeffrey Scott Vitter. In Proc. 32nd STOC, pages 397–406, 2000.
  25. Lempel-Ziv index for q-grams.
    Juha Kärkkäinen and Erkki Sutinen. Algorithmica, 21(1):137–154, 1998.
  26. Lempel-Ziv parsing and sublinear-size index structures for string matching.
    Juha Kärkkäinen and Esko Ukkonen. In Proc. 3rd WSP, pages 141–155, 1996.
  27. Efficient randomized pattern-matching algorithms.
    Richard M. Karp and Michael O. Rabin. IBM J. Res. Dev., 31(2):249–260, 1987.
  28. On compressing and indexing repetitive sequences.
    Sebastian Kreft and Gonzalo Navarro. Theoret. Comp. Sci., 483:115 – 133, 2013.
  29. Orthogonal range searching for text indexing.
    Moshe Lewenstein. In Space-Efficient Data Structures, Streams, and Algorithms - Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday, pages 267–302, 2013.
  30. Compact suffix array.
    Veli Mäkinen. In Proc. 11th CPM, pages 305–319, 2000.
  31. Storage and retrieval of highly repetitive sequence collections.
    Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. J. Comput. Bio., 17(3):281–308, 2010.
  32. Patricia—practical algorithm to retrieve information coded in alphanumeric.
    Donald R. Morrison. J. ACM, 15(4):514–534, October 1968.
  33. Indexing highly repetitive collections.
    Gonzalo Navarro. In Proc. 23rd IWOCA, pages 274–279, 2012.
  34. Compact Data Structures - A Practical Approach.
    Gonzalo Navarro. Cambridge University Press, 2016.
  35. Compressed full-text indexes.
    Gonzalo Navarro and Veli Mäkinen. ACM Comput. Surv., 39(1), 2007.
  36. Compressed full-text indexes.
    Gonzalo Navarro and Veli Mäkinen. ACM Comput. Surv., 39(1), April 2007.
  37. Exact and approximate pattern matching in the streaming model.
    Benny Porat and Ely Porat. In Proc. 50th FOCS, pages 315–323, 2009.
  38. Application of Lempel–Ziv factorization to the approximation of grammar-based compression.
    Wojciech Rytter. Theoret. Comp. Sci., 302(1–3):211 – 222, 2003.
  39. A universal algorithm for sequential data compression.
    Jacob Ziv and Abraham Lempel. IEEE Trans. Information Theory, 23(3):337–343, 1977.
10175
This is a comment super asjknd jkasnjk adsnkj
""
The feedback cannot be empty
Submit
Cancel