String Attractors: Verification and Optimization

# String Attractors: Verification and Optimization

Dominik Kempa University of Helsinki, Finland
dkempa@cs.helsinki.fi
Alberto Policriti University of Udine, Udine, Italy
alberto.policriti@uniud.it
Nicola Prezza Technical University of Denmark, Kgs. Lyngby, Denmark
{npre,erot}@dtu.dk
Eva Rotenberg Technical University of Denmark, Kgs. Lyngby, Denmark
{npre,erot}@dtu.dk
###### Abstract

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set is a -attractor for a string if and only if every distinct substring of of length at most has an occurrence straddling at least one of the positions in . Finding the smallest -attractor is NP-hard for , but polylogarithmic approximations can be found using reductions from dictionary compressors. It is easy to reduce the -attractor problem to a set-cover instance where string’s positions are interpreted as sets of substrings. The main result of this paper is a much more powerful reduction based on the truncated suffix tree. Our new characterization of the problem leads to more efficient algorithms for string attractors: we show how to check the validity and minimality of a -attractor in near-optimal time and how to quickly compute exact and approximate solutions. For example, we prove that a minimum -attractor can be found in optimal time when for any constant , and -approximation can be computed in time on general alphabets. To conclude, we introduce and study the complexity of the closely-related sharp--attractor problem: to find the smallest set of positions capturing all distinct substrings of length exactly . We show that the problem is in P for and is NP-complete for constant .

Dummy keyword – please provide 1–5 keywords

Dominik Kempa, Alberto Policriti, Nicola Prezza, and Eva Rotenberg\subjclassDummy classification – please refer to http://www.acm.org/about/class/ccs98-html\EventEditorsIoannis Chatzigiannakis, Christos Kaklamanis, Daniel Marx, and Don Sannella \EventNoEds4 \EventLongTitle45th International Colloquium on Automata, Languages, and Programming (ICALP 2018) \EventShortTitleICALP 2018 \EventAcronymICALP \EventYear2018 \EventDateJuly 9–13, 2018 \EventLocationPrague, Czech Republic \EventLogoeatcs \SeriesVolume80 \ArticleNo

## 1 Introduction

The goal of dictionary compression is to reduce the size of an input string by exploiting its repetitiveness. In the last decades, several dictionary compression techniques—some more powerful than others—were developed to achieve this goal: Straight-Line programs [17] (context-free grammars generating the string), Macro schemes [23] (a set of substring equations having the string as unique solution), the run-length Burrows-Wheeler transform [4] (a string permutation whose number of equal-letter runs decreases as the string’s repetitiveness increases), and the compact directed acyclic word graph [3, 6] (the minimization of the suffix tree). Each scheme from this family comes with its own set of algorithms and data structures to perform compressed-computation operations—e.g. random access—on the compressed representation. Despite being apparently unrelated, in [16] all these compression schemes were proven to fall under a common general scheme: they all induce a set whose cardinality is bounded by the compressed representation’s size and with the property that each distinct substring has an occurrence crossing at least a position in . A set with this property is called a string attractor. Intuitively, positions in a string attractor capture “interesting” regions of the string; a string of low complexity (that is, more compressible), will generate a smaller attractor. Surprisingly, given such a set one can build a data structure of size supporting random access queries in optimal time [16]: string attractors therefore provide a universal framework for performing compressed computation on top of any dictionary compressor (and even optimally for particular queries such as random access).

These premises suggest that an algorithm computing a smallest string attractor for a given string would be an invaluable tool for designing better compressed data structures. Unfortunately, computing a minimum string attractor is NP-hard. The problem remains NP-hard even under the restriction that only substrings of length at most are captured by , for any and on large alphabets. In this case, we refer to the problem as -attractor. Not all hope is lost, however: as shown in [16], dictionary compressors are actually heuristics for computing a small -attractor (with polylogarithmic approximation rate w.r.t. the minimum), and, more in general, -attractor admits a -approximation based on a reduction to set cover. It is actually easy to find such a reduction: choose as universe the set of distinct substrings and as set collection the string’s positions (i.e. set contains all substrings straddling position ). The main limitation of this approach is that the set of distinct substrings could be quadratic in size; this makes the strategy of little use in cases where the goal is to design usable (i.e. as close as possible to linear-time) algorithms on string attractors.

The core result of this paper is a much more powerful reduction from -attractor to set-cover: the universe of our instance is equal to the set of edges of the -truncated suffix tree, while the size of the set collection is bounded by the size of the -truncated suffix tree. First of all, we obtain a universe that is always at least times smaller than the naive approach. Moreover, the size of our set-cover instance does not depend on the string’s length , unless and do. This allows us to show that -attractor is actually solvable in polynomial time for small values of and , and leads us to efficient algorithms for a wide range of different problems on string attractors.

The paper is organized as follows. In Section 1.1 we describe the notation used throughout the paper and we report the main notions related to -attractors. In Section 1.2 we give the main theorem stating our reduction to set-cover (Theorem 1.2) and briefly discuss the results that we obtain in the rest of the paper by applying it. Theorem 1.2 itself is proven in Section 2, together with other useful lemmas that will be used to further improve the running times of the algorithms described in Section 3. Finally, in Section 4 we introduce and study the complexity of the closely-related sharp--attractor problem: to capture all distinct substrings of length exactly .

All proofs omitted for space reasons can be found in the Appendix.

### 1.1 Notation and definitions

indicates the set . The notation indicates a string of length with indices starting from . When is an array (e.g. a string or an integer array), indicates the sub-array .

We assume the reader to be familiar with the notions of suffix tree [24], suffix array [18], and wavelet tree [12, 21]. denotes the -truncated suffix tree of , i.e. the compact trie containing all substrings of of length at most . denotes the set of edges of the compact trie . denotes the set of leaves at maximum string depth of the compact trie (i.e. leaves whose string depth is equal to the maximum string depth among all leaves). Let be an edge in the (truncated) suffix tree of . With we denote the string read from the suffix tree root to the first character in the label of . is the length of this string. We will also refer to as the string depth of . Note that edges of the -truncated suffix tree have precisely the same labels of the suffix tree edges at string depth . It follows that we can use these two edge sets interchangeably when we are only interested in their labels (this will be the case in our results). Let denote the suffix array of . , with being an edge in the -truncated suffix tree, will denote the suffix array range corresponding to the string , i.e. contains all suffixes prefixed by .

Unless otherwise specified, we give the space of our data structures in words of bits each.

With the following definition we recall the notion of -attractor of a string [16].

{definition}

A -attractor of a string is a set of positions such that every substring with has at least one occurrence with for some .

We call attractor a -attractor for .

{definition}

A minimal -attractor of a string is a -attractor such that is not a -attractor of for any .

{definition}

A minimum -attractor of a string is a -attractor such that, for any -attractor of , .

{theorem}

[16, Thm. 8] The problem of deciding whether a string admits a -attractor of size at most is NP-complete for .

Note that we can pre-process the input string so that its characters are mapped into the range . This transformation can be computed in linear time and space. It is easy to see that a set is a -attractor for if and only if it is a -attractor for the transformed string. It follows that we do not need to put any restriction on the alphabet of the input string, and in the rest of the paper we can assume that the alphabet is , with .

### 1.2 Overview of the contributions

Our main theorem is a reduction to set-cover based on the notion of truncated suffix tree:

{theorem}

Let , with . -attractor can be reduced to a set-cover instance with universe and set collection of size .

Figure 1 depicts the main technique (Lemma 2) standing at the core of our reduction: a set is a valid attractor if and only if it marks (or colours, in the picture), all suffix tree edges.

Using the reduction of Theorem 1.2, we obtain the following results. First, we present efficient algorithms to check the validity and minimality of a -attractor. Note that it is trivial to perform these checks in cubic time (or quadratic with little more effort). In Theorem 3.1 we show that we can check whether a set is a valid -attractor for in time and words of space. Using recent advances in compact data structures, we show how to further reduce this working space to bits without affecting query times when , or with a small time penalty in the general case. In particular, when is constant we can always check the correctness of a -attractor in time and bits of space. With similar techniques, in Theorem 3.2 we show how to verify that is a minimal -attractor for in near-optimal time.

In Theorem 3.1 we show that the structure used in Theorems 3.1 and 3.2 can be augmented (using a recent result on weighted ancestors on suffix trees [11]) to support reporting all occurrences of a substring straddling a position in in optimal time.

We then focus on optimization problems. In Theorem 3.3 we show that a minimum -attractor can be found in time. Similarly, in Theorem 3.4 we show that a minimal -attractor can be found in expected time. With Theorem 3.4 we show that minimal -attractors are within a factor of from the optimum, therefore proving that Theorem 3.4 actually yields an approximation algorithm. In Theorem 3.5 we show that within the same time of Theorem 3.4 we can compute an exponentially-better -approximation.

Theorems 3.3, 3.4, and 3.5 yield the following corollaries:

{corollary}

-attractor is in P when .

{corollary}

For constant , a minimum -attractor can be found in optimal time when , for any constant .

###### Proof.

Pick any constant . Then, , where is a constant. On the other hand, for any constant we have that . It follows that , i.e. by Theorem 3.3 we can find a minimum -attractor in linear time. ∎

With our new results we can, for example (keep in mind that -attractor is NP-complete for general alphabets):

• Find a minimum -attractor in time when , for any .

• Find a minimal -attractor in expected time. This, by Theorem 3.4, is a -approximation to the minimum.

• Find a -approximation of the minimum -attractor in expected time.

To conclude, in Section 4 we study the sharp--attractor problem: to find a smallest set of positions covering all substrings of length exactly . In Theorem 4 we show that the problem is NP-complete for constant , while in Theorem 4 we give a polynomial-time algorithm for the case .

## 2 A better reduction to set-cover

In this section we give our main result: a smaller reduction from -attractor to set-cover. We start with an alternative characterization of -attractors based on the -truncated suffix tree.

{definition}

[Marker] is a marker for a suffix tree edge if and only if

 ∃i∈SA[le..re] : i≤j

Equivalently, we say that marks .

{definition}

[Edge marking] marks a suffix tree edge if and only if there exists a that marks .

{definition}

[Suffix tree -marking] is a suffix tree -marking if and only if it marks every edge such that (equivalently, every ).

When we simply say suffix tree marking (since all edges satisfy ). We now show that the notions of -attractor and suffix tree -marking are equivalent.

{lemma}

is a -attractor if and only if it is a suffix tree -marking.

###### Proof.

Let be a -attractor. Pick any suffix tree edge such that . Then, and, by definition of -attractor, there exists a and a such that and . We also have that (being precisely the suffix array range of suffixes prefixed by ). Putting these results together, we found a such that for some , which by Definition 2 means that marks . Since the argument works for any edge at string depth at most , we obtain that is a suffix tree -marking.

Let be a suffix tree -marking. Let moreover be a substring of of length at most . Consider the lowest suffix tree edge (i.e. with maximum ) such that prefixes . In particular, . Note that, by definition of suffix tree, every occurrence of in prefixes an occurrence of : . By definition of -marking, there exists a such that is a marker for , which means (by Definition 2) that . Since , is an occurrence of , and therefore of . But then, we have that , i.e. is an occurrence of straddling . Since the argument works for every substring of of length at most , we obtain that is a -attractor. ∎

An equivalent formulation of Lemma 2 is that is a -attractor if and only if it marks all edges of the -truncated suffix tree. In particular (case ), is an attractor if and only if it is a suffix tree marking.

Lemma 2 will be used to obtain a smaller universe in our set-cover reduction. With the following Lemmas we show that also the size of the set collection can be considerably reduced when and are small.

{definition}

[-equivalence] Two positions are -equivalent, indicated as , if and only if

 S′[i−k+1..i+k−1]=S′[j−k+1..j+k−1]

where if or (note that we allow negative positions) and otherwise, and is a new character.

It is easy to see that -equivalence is an equivalence relation. First, we bound the size of the distinct equivalence classes of (i.e. the size of the quotient set ).

{lemma}

We now show that any minimal -attractor can have at most one element from each equivalence class of .

{lemma}

If is a minimal -attractor, then for any it holds .

Moreover, if we swap any element of a -attractor with an equivalent element then the resulting set is still a -attractor:

{lemma}

Let be a -attractor. Then, is a -attractor for any and any .

###### Proof.

Pick any occurrence of a substring , , straddling position . By definition of , since there is also an occurrence of straddling . This implies that is a -attractor. ∎

Lemmas 2 and 2 imply that we can reduce the set of candidate positions from to (that is, an arbitrary representative—in this case, the minimum—from any class of ), and still be able to find a minimal/minimum -attractor. Note that, by Lemma 2, .

We can now prove our main theorem.

###### Proof of Theorem 1.2.

We build our set-cover instance as follows. We choose , i.e. the set of edges of the -truncated suffix tree. The set collection is defined as follows. Let and . Then, we choose

 S={si : i∈C}

By the way we defined , each is unambiguously identified by a substring of length of the string . We therefore obtain . We now prove correctness and completeness of the reduction.

Correctness By the definition of our reduction, a solution to yields a set of positions marking every edge in . Then, Lemma 2 implies that is a -attractor.

Completeness Let be a minimal -attractor. Then, Lemmas 2 and 2 imply that the following set is also a minimal -attractor of the same size: . Note that . By Lemma 2, marks every edge in . Then, by definition of our reduction the collection covers .

In the rest of the paper, we use the notation and to denote the universe to be covered (edges of the -truncated suffix tree) and the candidate attractor positions, respectively. Recall moreover that and .

### 2.1 Marker graph

In this section we introduce a graph that will play a crucial role in our approximation algorithms of Sections 3.4 and 3.5: our algorithms will take a time linear in the graph’s size to compute a solution. Intuitively, this graph represents the inclusion relations of the set-cover instance of Theorem 1.2.

{definition}

[Marker graph]

Given a positive integer , the marker graph of string is a bipartite undirected graph , where the set of edges is defined as

 E={⟨j,e⟩ : j marks e}
{lemma}

{lemma}

can be computed in expected time.

Putting our bounds together, we obtain:

{corollary}

Let . Then, takes space and can be built in expected time.

###### Proof.

From Lemma 2.1, . By Theorem 1.2, this space is . Since and , this space simplifies to .

Finally, note that , so the running time of Lemma 2.1 is . ∎

## 3 Faster algorithms

In this section we use properties of our reduction to provide faster algorithms for a range of problems: (i) checking that a given set is a -attractor, (ii) checking that a given set is a minimal -attractor, (iii) finding a minimum -attractor, (iv) finding a minimal -attractor, and (v) approximate a minimum -attractor. We note that problems (i)-(ii) admit naive cubic solutions, while problem (iii) is NP-hard for  [16].

### 3.1 Checking the attractor property

Given a string , a set , and an integer , is a -attractor for ? we show that this question can be answered in time.

The main idea is to use Lemma 2 and check, for every suffix tree edge at string depth at most , if marks . Consider the suffix array of and the array defined as follows: , where returns the smallest element larger than or equal to in the set (i.e. is the distance between and the element of following—and possibly including—). can be built in linear time and space by creating a bit-vector such that iff and pre-processing for constant-time successor queries [15, 5]. We build a range-minimum data structure (RMQ) on ( bits of space, constant query time [10]). Then for every suffix tree edge such that , we check (in constant time) that . The following lemma ensures that this is equivalent to checking whether marks .

{lemma}

if and only if marks .

###### Proof.

Assume that . By definition of , this means that there exist an index and a , with , such that . Equivalently, , i.e. marks .

Assume that marks . Then, by definition there exist an index and a such that . Then, . Since is computed taking the , , minimizing , it must be the case that . Since , this implies that . ∎

Together, Lemmas 2 and 3.1 imply that, if for every edge at string depth at most , then is a -attractor for . Since the suffix tree, as well as the other structures used by our algorithm, can be built in linear time and space on alphabet  [8] and checking each edge takes constant time, we obtain that the problem of checking whether a set is a valid -attractor can be solved in optimal time and words of space. We now show how to improve upon this working space by using recent results in the field of compact data structures. In the following result, we assume that the input string is packed in bits (that is, words).

We first need the following Lemma from [2]:

{lemma}

[2, Thm. 3] In time and bits of space we can enumerate the following information for each suffix tree edge :

• The suffix array range of the string , and

• the length of .

We can now prove our theorem. Note that the input set can be encoded in bits, so also the input fits in bits.

{theorem}

Given a string , a set , and an integer , we can check whether is a -attractor for in:

• Optimal bits of space and time, for any constant , or

• bits of space and time.

###### Proof.

To achieve the first trade-off we will replace the array (occupying bits) with a smaller data structure supporting random access to . We start by replacing the standard suffix array with a compressed suffix array (CSA) [9, 13]. Given a text stored in bits, the CSA can be built in deterministic time and optimal bits of space [20], and supports access queries to the suffix array in time [13], for any constant chosen at construction time. Given that and we can compute the successor function in constant time using a -bit data structure (array ), can be computed in time. Using access to , the RMQ data structure (occupying bits) can be built in time and bits of space [10, Thm. 5.8]. At this point, observe that the order in which we visit suffix tree edges does not affect the correctness of our algorithm. By using Lemma 3.1 we can enumerate and for every suffix tree edge in linear time and compact space, and check whenever (Lemma 3.1).

To achieve the second trade-off we observe that in our algorithm we only explore the suffix tree up to depth (i.e. we only perform the check of Lemma 3.1 when ), hence any can be replaced with without affecting the correctness of the verification procedure. In this way, array can be stored in just bits. To compute the array in time and compact space we observe that it suffices to access all pairs in any order (not necessarily ). From [2, Thm. 10], in time and bits of space we can build a compressed suffix array supporting constant-time LF function computation. By repeatedly applying LF from the first suffix array position, we enumerate entries of the inverse suffix array in right-to-left order in time [2, Lem. 1]. This yields the sequence of pairs , for and , which can be used to compute in linear time and compact space. As in the first trade-off, we use Lemma 3.1 to enumerate and for every suffix tree edge , and check whenever (Lemma 3.1). ∎

Note that with the second trade-off of Theorem 3.1 we achieve time and optimal -bits of space when (in particular, this is always the case when is constant). Note also that, since we now assume that the input string is packed in words, the running time is not optimal (being a lower-bound in this model).

As a by-product of Theorem 3.1, we obtain a data structure able to report, given a substring , all occurrences of straddling a position in .

{theorem}

Let be a string and be an attractor for . In space we can build a structure of size words supporting the following query: given a range , report all (or at most any number of) positions such that and for some . Every such is reported in constant time.

### 3.2 Checking minimality

Given a string , a set , and an integer , is a minimal -attractor for ? The main result of this section is that this question can be answered in almost-optimal time.

We first show that minimal -attractors admit a convenient characterization based on the concept of suffix tree -marking.

{definition}

[-necessary position] is -necessary with respect to a set , with , if and only if there is at least one suffix tree edge such that:

1. ,

2. marks , and

3. If marks , then

{definition}

[-necessary set] is -necessary if and only if all its elements are -necessary with respect to .

A remark: we give Definition 3.2 with respect to a general superset of , but for now (Definition 3.2) we limit our attention to the case . Later (Theorem 3.4) we will need the more general definition. For simplicity, in the proofs of the following two theorems we just say -necessary (referring to some ) instead of -necessary with respect to .

{lemma}

is a minimal -attractor if and only if:

1. It is a -attractor, and

2. it is -necessary.

A naive solution for the minimality-checking problem is to test the -attractor property on for every using Theorem 3.1. This solution, however, runs in quadratic time. Our efficient strategy consists in checking, for every suffix tree edge , if there is only one marking it. In this case, we flag as necessary. If, in the end, all attractor positions are flagged as necessary, then the attractor is minimal by Lemma 3.2.

Unfortunately, the following simple strategy does not work: for every suffix tree edge , report, with the structure of Theorem 3.1, two distinct occurrences of the string that straddle an attractor position. The reason why this does not work is that, even we find two such occurrences, the attractor position that they straddle could be the same (this could happen e.g. if is periodic). Our solution is, therefore, a bit more involved.

{theorem}

Given a string , a set , and an integer , we can check whether is a minimal -attractor for in time and space.

### 3.3 Computing a minimum k-attractor

Computing a minimum -attractor is NP-hard for and general . In this section we show that the problem is actually polynomial-time-solvable for small and . Our algorithm takes advantage of both our reduction to set-cover and the optimal verification algorithm of Theorem 3.1.

First, we give an upper-bound to the cardinality of the set of all minimal -attractors. This will speed up our procedure for finding a minimum -attractor (which must be, in particular, minimal). By Lemma 2, there are no more than -attractors for . With the following lemma, we give a better upper-bound to the number of minimal -attractors.

{lemma}

There cannot be more than minimal -attractors.

Using the above lemma, we now provide a strategy to find a minimum -attractor.

{theorem}

A minimum -attractor can be found in time.

### 3.4 Computing a minimal k-attractor

In this section we provide an algorithm to find a minimal -attractor, and then show that such a solution is a -approximation to the optimum.

{theorem}

A minimal -attractor can be found in expected time.

We now give the approximation ratio of minimal attractors, therefore showing that the strategy of Theorem 3.4 yields an approximation algorithm.

{theorem}

Any minimal -attractor is a -approximation to the minimum -attractor.

### 3.5 Better approximations to the minimum k-attractor

From [16], we can compute poly-logarithmic approximations to the smallest attractor in linear time using reductions from dictionary compression techniques. This strategy, however, works only for -attractors.

In [16, Thm. 10], the authors show that a simple reduction to -set cover allows one to compute in polynomial time a -approximation to the minimum -attractor, where is the -th harmonic number. This approximation ratio is at most for (and case is trivial to solve optimally in linear time). The key observation of [16, Thm. 10] is that we can view each text position as a set containing all distinct -mers, with , overlapping the position. Then, solving -attractor is equivalent to covering the universe set of all distinct substrings of length at most using the smallest possible number of sets . This is, precisely, a -set cover instance (since for all ), which can be approximated within a factor of using the greedy algorithm that at each step chooses the that covers the largest number of uncovered universe elements. A naive implementation of this procedure, however, runs in cubic time. We now show how to efficiently implement this greedy algorithm over the reduction of Theorem 1.2.

We first give a lemma needed to achieve our result.

{lemma}

Let be some universe of elements and be a function assigning a priority to each element of . In time we can build a priority queue initialized with elements of such that, later, we can perform any sequence of of the following operations in expected time:

• : return the with largest and remove from .

• : update , for a .

At any point in time, the size of is of words.

{theorem}

A -approximation of a minimum -attractor can be computed in expected time, where is the -th harmonic number.

Note that the approximation ratio of Theorem 3.5 is .

## 4 k-sharp attractors

In this section we consider a natural variant of string attractors we call -sharp attractors, and we prove some results concerning their computational complexity.

Formally, we define a -sharp attractor of a string to be a set of positions such that every substring with has an occurrence with for some . In other words, a -sharp-attractor is a subset that covers all substrings of length exactly .

By Minimum--Sharp-Attractor we denote the optimization problem of finding the smallest -sharp attractor of a given input string. By we denote the corresponding decision problem. The NP-completeness of -Sharp-Attractor for constant is obtained by a reduction from -SetCover problem that is NP-complete [7] for any constant : given integer and a collection of subsets of the universe set such that , and for any , , return YES iff there exists a subcollection such that and .

{theorem}

For any constant , -Sharp-Attractor is NP-complete.

###### Proof idea.

We obtain our reduction as follows (see the Appendix for full details). For any constant , given an instance of -SetCover we build a string of length , where denotes the number of sets in the input collection, with the following property: has a cover of size if and only if has a -sharp-attractor of size , where . The proof follows [16, Thm. 8], but the main gadgets are slightly different.

The main idea is to let each element correspond to a substring, which is repeated once for every set containing . We construct a substring for each set such that can always be covered by attractor positions, corresponding to choosing the set, but can be covered by attractor positions when all elements of the set already belong to chosen sets. And, indeed requires attractor positions. Finally, the substrings are padded and concatenated to form one long string . ∎

Denote the size of the minimum -sharp-attractor of string by . Observe that the above theorem is proved for constant values of . This is because, unlike for general -attractors, is not monotone with respect to . This becomes apprarent when we observe that for all , hence e.g. for we have . Note also that our reduction requires large alphabet.

Interestingly, however, for the -sharp-attractor admits a polynomial-time algorithm. Note that such a result is not known for -attractors (the case being the only one still open).

{theorem}

Minimum -sharp-attractor is in P.

###### Proof.

It is easy to show that -sharp-attractor is in by a reduction to edge cover. Given a string , let be the set of strings of length that occur at least once in . For every substring of length of the form , add the edge to the edge-set , and add self-loops for the first and last pair.

A position thus corresponds to an edge, , and it is easy to see that is a -sharp-attractor if and only if is an edge cover.

The number of vertices and edges in this graph are both , so a minimum edge cover can be found in time [19]. ∎

## References

• [1] Djamal Belazzougui. Linear time construction of compressed text indices in compact space. In Symposium on Theory of Computing, STOC 2014, New York, NY, USA, May 31 - June 03, 2014, pages 148–193. ACM, 2014.
• [2] Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. arXiv preprint arXiv:1609.06378, 2016.
• [3] Anselm Blumer, Janet Blumer, David Haussler, Ross McConnell, and Andrzej Ehrenfeucht. Complete inverted files for efficient text retrieval and analysis. Journal of the ACM (JACM), 34(3):578–595, 1987.
• [4] Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm, 1994.
• [5] David Clark. Compact Pat trees. PhD thesis, University of Waterloo, 1998.
• [6] Maxime Crochemore and Renaud Vérin. Direct construction of compact directed acyclic word graphs. In Combinatorial Pattern Matching, pages 116–129. Springer, 1997.
• [7] Rong-chii Duh and Martin Fürer. Approximation of k-set cover by semi-local optimization. In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing, El Paso, Texas, USA, May 4-6, 1997, pages 256–264, 1997.
• [8] Martin Farach. Optimal suffix tree construction with large alphabets. In Foundations of Computer Science, 1997. Proceedings., 38th Annual Symposium on, pages 137–143. IEEE, 1997.
• [9] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, 2005.
• [10] J. Fischer and V. Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing, 40(2):465–492, 2011.
• [11] Paweł Gawrychowski, Moshe Lewenstein, and Patrick K Nicholson. Weighted ancestors in suffix trees. In European Symposium on Algorithms, pages 455–466. Springer, 2014.
• [12] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 841–850. Society for Industrial and Applied Mathematics, 2003.
• [13] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005.
• [14] Prosenjit Gupta, Ravi Janardan, and Michiel Smid. Further results on generalized intersection searching problems: counting, reporting, and dynamization. Journal of Algorithms, 19(2):282–317, 1995.
• [15] Guy Joseph Jacobson. Succinct static data structures. PhD thesis, Carnegie Mellon University, 1988.
• [16] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In Proceedings of the 50th annual ACM symposium on Theory of computing (to appear). ACM, 2018.
• [17] J. C. Kieffer and E.-H. Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000.
• [18] Udi Manber and Gene Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935–948, 1993.
• [19] Silvio Micali and Vijay V. Vazirani. An O(sqrt(|V|)|E|) algorithm for finding maximum matching in general graphs. In Proceedings of the 21st Annual Symposium on Foundations of Computer Science, SFCS ’80, pages 17–27, Washington, DC, USA, 1980. IEEE Computer Society.
• [20] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 408–424. SIAM, 2017.
• [21] Gonzalo Navarro. Wavelet trees for all. Journal of Discrete Algorithms, 25:2–20, 2014.
• [22] Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016.
• [23] James A Storer and Thomas G Szymanski. Data compression via textual substitution. Journal of the ACM (JACM), 29(4):928–951, 1982.
• [24] Peter Weiner. Linear pattern matching algorithms. In Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on, pages 1–11. IEEE, 1973.

## Appendix A Appendix

###### Proof of Lemma 2.

By the way we defined , the set has one element per distinct substring of length in , that is, per distinct path from the suffix tree root to each of the nodes in . Clearly, . On the other hand, there are at most distinct substrings of length on . There are other additional substrings to consider on the borders of (to include the runs of symbol ). It follows that the cardinality of is upper-bounded also by . ∎

###### Proof of Lemma 2.

Suppose, by contradiction, that for some . Then, let , with . By definition of , . This means that if a substring of of length at most has an occurrence straddling position in then it has also one occurrence straddling position . On the other hand, any other substring occurrence straddling any position is also captured by since belongs to this set. This implies that is a -attractor, which contradicts the minimality of . ∎

###### Proof of Lemma 2.1.

Clearly, . Note that each position crosses at most distinct substrings of length at most in , therefore this is also an upper-bound to the number of suffix tree edges it can mark. It follows that there are at most edges in sharing a fixed , which implies .

###### Proof od Lemma 2.1.

Clearly, can be computed in linear time using the suffix tree of , as this set contains suffix tree edges at string depth at most .

The set can be computed considering all suffix tree nodes (explicit or implicit) at string depth , extracting the leftmost occurrence in their induced sub-tree (i.e. an occurrence of the string of length read from the root to the node), and adding . This task takes linear time once the suffix tree of is built.

Let return the suffix tree edge reached following the path labeled from the suffix tree root. This edge can be found in constant time using the optimal structure for weighted ancestors queries on the suffix tree described in [11] (which can be build beforehand in time and space). Let moreover be the parent edge of , i.e. and for some suffix tree nodes . We implement using hashing, so insert and membership operations take expected constant time.

To build we proceed as follows. For every , and for :

1. Find edge .

2. If and , then insert in . Otherwise, proceed at step 1 with the next value of .

3. . Repeat from step 2.

Correctness It is easy to see that we only insert in edges such that marks (because we check that ), so the algorithm is correct.

Completeness We now show that if marks , then we insert in . Assume that marks . Consider the (unique) occurrence overlapping in its leftmost position (i.e. and is minimized; note that there could be multiple occurrences of overlapping ). Consider the moment when we compute at step 1. We now show that, for each edge on the path from to , it must hold . This will prove the property, since then we insert all these edges (including ) in . Assume, by contradiction, that for some on the path from to . Then this means that in the past at step 1. we have already considered some , with , prefixed by (note that it must be the case that since we consider values of in decreasing order). But then, since prefixes , it also prefixes , i.e. and . This is in contradiction with the way we defined (i.e. is minimized).

Complexity Overall, in step 1. we call times function (constant time per call). Then, in steps 2. and 3. we only spend (constant) time whenever we insert a new edge in (since we check before inserting). Overall, our algorithm runs in time. ∎

###### Proof of Lemma 3.1.

In [2, Thm. 3] (see also [1]) the authors show how to enumerate the following information for each right-maximal substring of in time and bits of space: and the suffix array range of the string , for all such that is a substring of . Since is right-maximal, those are equal to our strings (for every edge ). It follows that our problem is solved by outputting all and returned by the algorithm in [2, Thm. 3]. ∎

###### Proof of Theorem 3.1.

The idea is to use the variant of Theorem 3.1 based on the (uncompressed) suffix tree together with the optimal structure for weighted ancestors queries on the suffix tree described in [11]. Weighted ancestors on the suffix tree can be used to find the suffix tree edge where string ends (starting from the root) in constant time given as input the range  [11]. Reporting an arbitrary number of occurrences of straddling a position in corresponds to solving a three-sided orthogonal range reporting query: we find the minimum in and check if . If the answer is yes, we output