Minimal Suffix and Rotation of a Substring in Optimal Time1footnote 11footnote 1This work is supported by Polish budget funds for science in 2013-2017 as a research project under the ‘Diamond Grant’ program.

Minimal Suffix and Rotation of a Substring in Optimal Time1

Abstract

For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by product of these time complexities. Next, we extend our queries to support concatenations of substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing.

Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings.

1 Introduction

Lyndon words, as well as the inherently linked concepts of the lexicographically minimal suffix and the lexicographically minimal rotation of a string, are one of the most successful concepts of combinatorics of words. Introduced by Lyndon [28] in the context of Lie algebras, they are widely used in algebra and combinatorics. They also have surprising algorithmic applications, including ones related to constant-space pattern matching [14], maximal repetitions [7], and the shortest common superstring problem [30].

The central combinatorial property of Lyndon words, proved by Chen et al. [9], states that every string can be uniquely decomposed into a non-increasing sequence of Lyndon words. Duval [16] devised a simple algorithm computing the Lyndon factorization in linear time and constant space. He also observed that the same algorithm can be used to determine the lexicographically minimal and maximal suffix, as well as the lexicographically minimal and maximal rotation of a given string.

The first two algorithms are actually on-line procedures: in linear time they allow computing the minimal and maximal suffix of every prefix of a given string. For rotations such a procedure was later introduced by Apostolico and Crochemore [3]. Both these solutions lead to the optimal, quadratic-time algorithms computing the minimal and maximal suffixes and rotations for all substring of a given string. Our main results are the data-structure versions of these problems: we preprocess a given text to answer the following queries:

For both problems we obtain optimal solutions with linear construction time and constant query time. For Minimal Suffix Queries this improves upon the results of Babenko et al. [5], who developed a trade-off solution, which for a text of length has product of preprocessing and query time. We are not aware of any results for Minimal Rotation Queries except for a data structure only testing cyclic equivalence of two subwords [26]. It allows constant-time queries after randomized preprocessing running in expected linear time.

An optimal solution for the Maximal Suffix Queries was already obtained in [5], while the Maximal Rotation Queries are equivalent to Minimal Rotation Queries subject to alphabet reversal. Hence, we do not focus on the maximization variants of our problems.

Using an auxiliary result devised to handle Minimal Rotation Queries, we also develop a data structure answering in time the following generalized queries:

All our algorithms are deterministic procedures for the standard word RAM model with machine words of size  [19]. The alphabet is assumed to be where , so that all letters of the input text can be sorted in linear time.

Applications. The last factor of the Lyndon factorization of a string is its minimal suffix. As noted in [5], this can be used to reduce computing the factorization of a substring to Minimal Suffix Queries in . Hence, our data structure determines the factorization in the optimal time. If is a concatenation of substrings, this increases to time (which we did not attempt to optimize in this paper).

The primary use of Minimal Rotation Queries is canonization of substrings, i.e., classifying them according to cyclic equivalence (conjugacy); see [3]. As a proof-of-concept application of this natural tool, we propose counting distinct substring with a given exponent.

Related work. Our work falls in a class of substring queries: data structure problems solving basic stringology problems for substrings of a preprocessed text. This line of research, implicitly initiated by substring equality and longest common prefix queries (using suffix trees and suffix arrays; see [11]), now includes several problems related to compression [10, 24, 26, 6], pattern matching [26], and the range longest common prefix problem [1, 31, 2]. Closest to ours is a result by Babenko et al. [6], which after -expected-time preprocessing allows determining the -th smallest suffix of a given substring, as well as finding the lexicographic rank of one substring among suffixes of another substring, both in logarithmic time

Outline of the paper. In Section 2 we recall standard definitions and two well-known data structures. Next, in Section 3, we study combinatorics of minimal suffixes, using in particular a notion of significant suffixes, introduced by I et al. [21, 22] to compute Lyndon factorizations of grammar-compressed strings. Section 4 is devoted to answering Minimal Suffix Queries. We use fusion trees by Pătraşcu and Thorup [32] to improve the query time from logarithmic to , and then, by preprocessing shorts strings, we achieve constant query time. That final step uses a notion of order-isomorphism [27, 25] to reduce the number of precomputed values. In Section 5 we repeat the same steps for Generalized Minimal Suffix Queries. We conclude with Section 6, where we briefly discuss the applications.

2 Preliminaries

We consider strings over an alphabet with the natural order . The empty string is denoted as . By () we denote the set of all (resp. non-empty) finite strings over . We also define as the set of infinite strings over . We extend the order on in the standard way to the lexicographic order on .

Let be a string in . We call the length of and denote it by . For , a string is called a substring of . By we denote the occurrence of at position , called a fragment of . A fragment of other than the whole is called a proper fragment of . A fragment starting at position is called a prefix of and a fragment ending at position is called a suffix of . We use abbreviated notation and for a prefix and a suffix of , respectively. A border of is a substring of which occurs both as a prefix and as a suffix of . An integer , , is a period of if for . If has period , we also say that is has exponent . Note that is a period of if and only if has a border of length .

We say that a string is a rotation (cyclic shift, conjugate) of a string if there exists a decomposition such that . Here, is the left rotation of by characters and the right rotation of by characters.

Enhanced suffix array. The suffix array [29] of a text of length is a permutation of defining the lexicographic order on suffixes : if and only if . For a string , both and its inverse permutation take space and can be computed in time; see e.g. [11]. Typically, one also builds the table and extends it with a data structure for range minimum queries [20, 8], so that the longest common prefix of any two suffixes of can be determined efficiently.

Similarly to [5], we also construct these components for the reversed text . Additionally, we preprocess the table to answer range minimum and maximum queries. The resulting data structure, which we call the enhanced suffix array of , lets us perform many queries.

Theorem 2.1 (Enhanced suffix array; see Fact 3 and Lemma 4 in [5]).

The enhanced suffix array of a text of length  takes space, can be constructed in time, and allows answering the following queries in time given fragments , of :

  1. determine if , , or ,

  2. compute the the longest common prefix and the longest common suffix ,

  3. compute and determine if , , or .

Moreover, given indices , it can compute in time the minimal and the maximal suffix among .

Fusion trees. Consider a set of -bit integers (recall that is the machine word size). Rank queries given a -bit integer return defined as . Similarly, select queries given an integer , , return , the -th smallest element in , i.e., such that . These queries can be used to determine the predecessor and the successor of a -bit integer , i.e., and . We answer these queries with dynamic fusion trees by Pătraşcu and Thorup [32]. We only use these trees in a static setting, but the original static fusion trees by Fredman and Willard [17] do not have an efficient construction procedure.

Theorem 2.2 (Fusion trees [32, 17]).

There exists a data structure of size which answers , , , and queries in time. Moreover, it can be constructed in time.

3 Combinatorics of minimal suffixes and Lyndon words

For a non-empty string the minimal suffix is the lexicographically smallest non-empty suffix of . Similarly, for an arbitrary string the maximal suffix is the lexicographically largest suffix of . We extend these notions as follows: for a pair of strings we define and as the lexicographically smallest (resp. largest) string such that is a (possibly empty) suffix of .

In order to relate minimal and maximal suffixes, we introduce the reverse order on and extend it to the reverse lexicographic order, and an auxiliary symbol . We extend the order on so that (and thus ) for every . We define , but unless otherwise stated, we still assume that the strings considered belong to .

Observation 3.1.

If , then if and only if .

We use and to denote the minimal (resp. maximal) suffix with respect to . The following observation relates the notions we introduced:

Observation 3.2.
  1. for every ,

  2. for every and ,

  3. for every and ,

  4. for every ,

  5. for every .

A property seemingly similar to (e) is false for every : .

A notion deeply related to minimal and maximal suffixes is that of a Lyndon word [28, 9]. A string is called a Lyndon word if . Note that such does not have proper borders, since a border would be a non-empty suffix smaller than . A Lyndon factorization of a string is a representation , where are Lyndon words such that . Every non-empty word has a unique Lyndon factorization [9], which can be computed in linear time and constant space [16]. The following result provides a characterization of the Lyndon factorization of a concatenation of two strings:

Lemma 3.3 ([4, 15]).

Let and be Lyndon factorization. Then the Lyndon factorization of is for integers and a Lyndon word such that , , and .

Next, we prove another simple yet useful property of Lyndon words:

Fact 3.4.

Let be strings such that is a Lyndon word. If , then .

Proof.

For a proof by contradiction suppose that . Let , where is not a prefix of . Note that as must be a prefix of . Because , we have . On the other hand, is a Lyndon word, so . Consequently, . Since , must be a prefix of , which contradicts the definition of . ∎

3.1 Significant suffixes

Below we recall a notion of significant suffixes, introduced by I et al. [21, 22] in order to compute Lyndon factorizations of grammar-compressed strings. Then, we state combinatorial properties of significant suffixes; some of them are novel and some were proved in [22].

Definition 3.5 (see [21, 22]).

A suffix of a string is a significant suffix of if for some .

Let be the Lyndon factorization of a string . For we denote ; moreover, we assume . Let be the smallest index such that is a prefix of for . Observe that , since is a prefix of . We define so that , and we set . Note that . We also denote , , and . The observation below lists several immediate properties of the introduced strings:

Observation 3.6.

For each , : (a) , (b) is a suffix of of length , and (c) . In particular, .

The following lemma shows that is equal to the set of significant suffixes of . (Significant suffixes are actually defined in [22] as and only later proved to satisfy our Definition 3.5.) In fact, the lemma is much deeper; in particular, the formula for is one of the key ingredients of our efficient algorithms answering Minimal Suffix Queries.

Lemma 3.7 (I et al. [22], Lemmas 12–14).

For a string let , , , and , be defined as above. Then Moreover, for every string we have

In other words, where .

We apply Lemma 3.7 to deduce several properties of the set of significant suffixes.

Corollary 3.8.

For every string :

  1. the largest suffix in is and ,

  2. if is a suffix of such that , then .

Proof.

To prove (a), observe that , so . Consequently, Lemma 3.7 states that . However, we have by creftype 3.2(e), and thus . Uniqueness of the Lyndon factorization implies that is the Lyndon factorization of , and hence by definition of we have .

For a proof of (b), we shall show that for the string is a significant suffix of . Note that, by creftype 3.6, is a suffix of , since . The suffix is clearly a significant suffix of , so we assume . By Lemma 3.7, one can choose (setting ) so that . However, this also implies because all suffixes of are suffixes of . Consequently, is a significant suffix of , as claimed. ∎

Below we provide a precise characterization of for in terms of and . This is another key ingredient of our data structure, in particular letting us efficiently compute significant suffixes of a given fragment of .

Lemma 3.9.

Let be strings such that . Also, let , , and let be the longest suffix in which is a prefix of . Then

Consequently, for every , we have .

Proof.

creftype 3.2 yields . By Corollary 3.8(a) this is equivalent to . Consequently, if , then and Corollary 3.8(a) implies , as claimed.

Thus, we may assume that , and in particular that . Let be the longest suffix in (). By Corollary 3.8(b), . Lemma 3.3 and the definition the set in terms of the Lyndon factorization yield that the inclusion above is actually an equality. Moreover, the definition also implies that is a prefix of , and thus . If , this already proves our statement, so in the remainder of the proof we assume .

First, let us suppose that . We shall prove that and is a period of . Let be a string such that . Note that is a border of  (as is a border of ), so is also a border of (because is a prefix of , which is a prefix of ). Moreover, by definition of the set, must be a power of a Lyndon word. Lyndon words do not proper borders, so any border of must be a power of the same Lyndon word. Thus, is a power of . As is a Lyndon word and a prefix of , this means that . Consequently, since . What is more, as is a prefix of , we conclude that is a period of . Therefore, is also a period of .

It remains to prove that implies that is not a period of . For a proof by contradiction suppose that both and is a period of . Let us define so that . As is a period of and contained in , we conclude that is a substring of , and consequently is also a period of and hence a period of as well. However, by definition of the set, is a power of a Lyndon word whose length exceeds and thus also . This Lyndon word cannot have a proper border, and such a border is induced by period , a contradiction.

Finally, observe that the second claim easily follows from . ∎

We conclude with two combinatorial lemmas, useful to in determining for . The first of them is also applied later in Section 5.

Lemma 3.10.

Let and be strings such that and the longest common prefix of and is not a proper substring of . Also, let . If , then .

Proof.

Due to the characterization in Lemma 3.7, we may equivalently prove that is or . Clearly, , so it suffices to show that . This is clear if , so we assume .

This assumption in particular yields that consists of proper substrings of , and thus by the condition on the longest common prefix of and . However, the inequality in Lemma 3.7 implies

This concludes the proof. ∎

Lemma 3.11.

Let , be the Lyndon factorization of , and let . If for some and we have , then for every non-empty suffix of satisfying .

Proof.

Let be a string such that . First, suppose that . In this case is a proper suffix of a Lyndon word , and thus and, moreover, . Thus, we may assume that .

Let and let be a string such that . Observe that it suffices to prove that , which implies that for . If there is nothing to prove, so we shall assume . Note that we have the Lyndon factorization with or . By Lemma 3.7, implies and is equivalent to (if ) or (if ). We have

as claimed. If , this already concludes the proof, and thus we may assume that . By definition of the Lyndon factorization we have , and by creftype 3.4 this implies . Hence, , which concludes the proof. ∎

4 Answering Minimal Suffix Queries

In this section we present our data structure for Minimal Suffix Queries. We proceed in three steps improving the query time from via to . The first solution is an immediate application of creftype 3.2(c) and the notion of significant suffixes. Efficient computation of these suffixes, also used in the construction of further versions of our data structure, is based on Lemma 3.9, which yields a recursive procedure. The only “new” suffix needed at each step is determined using the following result, which can be seen as a cleaner formulation of Lemma 14 in [5].

Lemma 4.1.

Let and be fragments of such that . Using the enhanced suffix array of we can compute in time.

Proof.

Let and note that, by creftype 3.2(d), . Let us focus on determining the latter value. The enhanced suffix array lets us compute a index , , which minimizes . Equivalently, we have . Consequently, for some . Since , is not a proper substring of , and by Lemma 3.10, we have (if , then ).

Thus, we shall generate a suffix of equal to if , and return the better of the two candidates for . If , we must have and there is nothing to do. Hence, let us assume . By Lemma 3.11, if we compute an index , , which minimizes , we shall have provided that . Now, can be generated as the largest integer such that is a suffix of , and we have , which lets us determine . ∎

Lemma 4.2.

Given a fragment of , we can compute in time using the enhanced suffix array of

Proof.

If , we return . Otherwise, we decompose so that . We recursively generate and use Lemma 4.1 to compute . Then, we apply the characterization of Lemma 3.9 to determine , using the enhanced suffix array (Theorem 2.1) to lexicographically compare fragments of .

We store the lengths of the significant suffixes in an ordered list. This way we can implement a single phase (excluding the recursive calls) in time proportional to plus the number of suffixes removed from to obtain . Since this is amortized constant time, the total running time becomes as announced. ∎

Corollary 4.3.

Minimal Suffix Queries queries can be answered in time using the enhanced suffix array of .

Proof.

Recall that creftype 3.2(c) yields where . Consequently, for some . We apply Lemma 4.2 to compute and determine the answer among candidates using lexicographic comparison of fragments, provided by the enhanced suffix array (Theorem 2.1). ∎

4.1 -time Minimal Suffix Queries

An alternative -time algorithm could be developed based just on the second part of Lemma 3.9: decompose so that and return . The result is due to Lemma 3.9 and creftype 3.2(c). Here, the first candidate is determined via Lemma 4.1, while the second one using a recursive call. A way to improve query time to at the price of -time preprocessing is to precompute the answers for basic fragments, i.e., fragments whose length is a power of two. Then, in order to determine , we perform just a single step of the aforementioned procedure, making sure that is a basic fragment. Both these ideas are actually present in [5], along with a smooth trade-off between their preprocessing and query times.

Our -time query algorithm combines recursion with preprocessing for certain distinguished fragments. More precisely, we say that is distinguished if both and for some positive integer , where . Note that the number of distinguished fragments of length is at most .

The query algorithm is based on the following decomposition ( for ):

Fact 4.4.

Given a fragment such that , we can in constant time decompose such that , is distinguished, and .

Proof.

Let and . We determine as the largest integer strictly smaller than divisible by . By the assumption that , we conclude that . We define and partition so that is the largest possible power of two. This guarantees . Moreover, assures that , so , and therefore is indeed distinguished. ∎

creftype 3.2(b) implies that . Lemma 3.9 further yields . In other words, it leaves us with three candidates for . Our query algorithm obtains using Lemma 4.1, computes recursively, and determines through the characterization of Lemma 3.7. The latter step is performed using the following component based on a fusion tree, which we build for all distinguished fragments.

Lemma 4.5.

Let be a fragment of . There exists a data structure of size which answers the following queries in time: given a position compute . Moreover, this data structure can be constructed in time using the enhanced suffix array of .

Proof.

By Lemma 3.7, we have , so in order to determine , it suffices to store and efficiently compute given . We shall reduce these queries to queries in an integer set .

Claim.

Denote and let

For every index , , we have

Proof.

We shall prove that for each ,