Minimal Suffix and Rotation of a Substring in Optimal TimeThis work is supported by Polish budget funds for science in 2013-2017 as a research project under the Diamond Grant program.

Minimal Suffix and Rotation of a Substring in Optimal Time1

Abstract

For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by product of these time complexities. Next, we extend our queries to support concatenations of substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing.

Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings.

1Introduction

Lyndon words, as well as the inherently linked concepts of the lexicographically minimal suffix and the lexicographically minimal rotation of a string, are one of the most successful concepts of combinatorics of words. Introduced by Lyndon [28] in the context of Lie algebras, they are widely used in algebra and combinatorics. They also have surprising algorithmic applications, including ones related to constant-space pattern matching [14], maximal repetitions [7], and the shortest common superstring problem [30].

The central combinatorial property of Lyndon words, proved by Chen et al. [9], states that every string can be uniquely decomposed into a non-increasing sequence of Lyndon words. Duval [16] devised a simple algorithm computing the Lyndon factorization in linear time and constant space. He also observed that the same algorithm can be used to determine the lexicographically minimal and maximal suffix, as well as the lexicographically minimal and maximal rotation of a given string.

The first two algorithms are actually on-line procedures: in linear time they allow computing the minimal and maximal suffix of every prefix of a given string. For rotations such a procedure was later introduced by Apostolico and Crochemore [3]. Both these solutions lead to the optimal, quadratic-time algorithms computing the minimal and maximal suffixes and rotations for all substring of a given string. Our main results are the data-structure versions of these problems: we preprocess a given text to answer the following queries:

For both problems we obtain optimal solutions with linear construction time and constant query time. For Minimal Suffix Queries this improves upon the results of Babenko et al. [5], who developed a trade-off solution, which for a text of length has product of preprocessing and query time. We are not aware of any results for Minimal Rotation Queries except for a data structure only testing cyclic equivalence of two subwords [26]. It allows constant-time queries after randomized preprocessing running in expected linear time.

An optimal solution for the Maximal Suffix Queries was already obtained in [5], while the Maximal Rotation Queries are equivalent to Minimal Rotation Queries subject to alphabet reversal. Hence, we do not focus on the maximization variants of our problems.

Using an auxiliary result devised to handle Minimal Rotation Queries, we also develop a data structure answering in time the following generalized queries:

All our algorithms are deterministic procedures for the standard word RAM model with machine words of size [19]. The alphabet is assumed to be where , so that all letters of the input text can be sorted in linear time.

Applications. The last factor of the Lyndon factorization of a string is its minimal suffix. As noted in [5], this can be used to reduce computing the factorization of a substring to Minimal Suffix Queries in . Hence, our data structure determines the factorization in the optimal time. If is a concatenation of substrings, this increases to time (which we did not attempt to optimize in this paper).

The primary use of Minimal Rotation Queries is canonization of substrings, i.e., classifying them according to cyclic equivalence (conjugacy); see [3]. As a proof-of-concept application of this natural tool, we propose counting distinct substring with a given exponent.

Related work. Our work falls in a class of substring queries: data structure problems solving basic stringology problems for substrings of a preprocessed text. This line of research, implicitly initiated by substring equality and longest common prefix queries (using suffix trees and suffix arrays; see [11]), now includes several problems related to compression [10], pattern matching [26], and the range longest common prefix problem [1]. Closest to ours is a result by Babenko et al. [6], which after -expected-time preprocessing allows determining the -th smallest suffix of a given substring, as well as finding the lexicographic rank of one substring among suffixes of another substring, both in logarithmic time

Outline of the paper. In we recall standard definitions and two well-known data structures. Next, in , we study combinatorics of minimal suffixes, using in particular a notion of significant suffixes, introduced by I et al. [21] to compute Lyndon factorizations of grammar-compressed strings. is devoted to answering Minimal Suffix Queries. We use fusion trees by Pătraşcu and Thorup [32] to improve the query time from logarithmic to , and then, by preprocessing shorts strings, we achieve constant query time. That final step uses a notion of order-isomorphism [27] to reduce the number of precomputed values. In we repeat the same steps for Generalized Minimal Suffix Queries. We conclude with , where we briefly discuss the applications.

2Preliminaries

We consider strings over an alphabet with the natural order . The empty string is denoted as . By () we denote the set of all (resp. non-empty) finite strings over . We also define as the set of infinite strings over . We extend the order on in the standard way to the lexicographic order on .

Let be a string in . We call the length of and denote it by . For , a string is called a substring of . By we denote the occurrence of at position , called a fragment of . A fragment of other than the whole is called a proper fragment of . A fragment starting at position is called a prefix of and a fragment ending at position is called a suffix of . We use abbreviated notation and for a prefix and a suffix of , respectively. A border of is a substring of which occurs both as a prefix and as a suffix of . An integer , , is a period of if for . If has period , we also say that is has exponent . Note that is a period of if and only if has a border of length .

We say that a string is a rotation (cyclic shift, conjugate) of a string if there exists a decomposition such that . Here, is the left rotation of by characters and the right rotation of by characters.

Enhanced suffix array. The suffix array [29] of a text of length is a permutation of defining the lexicographic order on suffixes : if and only if . For a string , both and its inverse permutation take space and can be computed in time; see e.g. [11]. Typically, one also builds the table and extends it with a data structure for range minimum queries [20], so that the longest common prefix of any two suffixes of can be determined efficiently.

Similarly to [5], we also construct these components for the reversed text . Additionally, we preprocess the table to answer range minimum and maximum queries. The resulting data structure, which we call the enhanced suffix array of , lets us perform many queries.

Fusion trees. Consider a set of -bit integers (recall that is the machine word size). Rank queries given a -bit integer return defined as . Similarly, select queries given an integer , , return , the -th smallest element in , i.e., such that . These queries can be used to determine the predecessor and the successor of a -bit integer , i.e., and . We answer these queries with dynamic fusion trees by Pătraşcu and Thorup [32]. We only use these trees in a static setting, but the original static fusion trees by Fredman and Willard [17] do not have an efficient construction procedure.

3Combinatorics of minimal suffixes and Lyndon words

For a non-empty string the minimal suffix is the lexicographically smallest non-empty suffix of . Similarly, for an arbitrary string the maximal suffix is the lexicographically largest suffix of . We extend these notions as follows: for a pair of strings we define and as the lexicographically smallest (resp. largest) string such that is a (possibly empty) suffix of .

In order to relate minimal and maximal suffixes, we introduce the reverse order on and extend it to the reverse lexicographic order, and an auxiliary symbol . We extend the order on so that (and thus ) for every . We define , but unless otherwise stated, we still assume that the strings considered belong to .

We use and to denote the minimal (resp. maximal) suffix with respect to . The following observation relates the notions we introduced:

A property seemingly similar to ( ?) is false for every : .

A notion deeply related to minimal and maximal suffixes is that of a Lyndon word [28]. A string is called a Lyndon word if . Note that such does not have proper borders, since a border would be a non-empty suffix smaller than . A Lyndon factorization of a string is a representation , where are Lyndon words such that . Every non-empty word has a unique Lyndon factorization [9], which can be computed in linear time and constant space [16]. The following result provides a characterization of the Lyndon factorization of a concatenation of two strings:

Next, we prove another simple yet useful property of Lyndon words:

For a proof by contradiction suppose that . Let , where is not a prefix of . Note that as must be a prefix of . Because , we have . On the other hand, is a Lyndon word, so . Consequently, . Since , must be a prefix of , which contradicts the definition of .

3.1Significant suffixes

Below we recall a notion of significant suffixes, introduced by I et al. [21] in order to compute Lyndon factorizations of grammar-compressed strings. Then, we state combinatorial properties of significant suffixes; some of them are novel and some were proved in [22].

Let be the Lyndon factorization of a string . For we denote ; moreover, we assume . Let be the smallest index such that is a prefix of for . Observe that , since is a prefix of . We define so that , and we set . Note that . We also denote , , and . The observation below lists several immediate properties of the introduced strings:

The following lemma shows that is equal to the set of significant suffixes of . (Significant suffixes are actually defined in [22] as and only later proved to satisfy our .) In fact, the lemma is much deeper; in particular, the formula for is one of the key ingredients of our efficient algorithms answering Minimal Suffix Queries.

We apply to deduce several properties of the set of significant suffixes.

To prove ( ?), observe that , so . Consequently, states that . However, we have by ( ?), and thus . Uniqueness of the Lyndon factorization implies that is the Lyndon factorization of , and hence by definition of we have .

For a proof of ( ?), we shall show that for the string is a significant suffix of . Note that, by , is a suffix of , since . The suffix is clearly a significant suffix of , so we assume . By , one can choose (setting ) so that . However, this also implies because all suffixes of are suffixes of . Consequently, is a significant suffix of , as claimed.

Below we provide a precise characterization of for in terms of and . This is another key ingredient of our data structure, in particular letting us efficiently compute significant suffixes of a given fragment of .

yields . By ( ?) this is equivalent to . Consequently, if , then and ( ?) implies , as claimed.

Thus, we may assume that , and in particular that . Let be the longest suffix in (). By ( ?), . and the definition the set in terms of the Lyndon factorization yield that the inclusion above is actually an equality. Moreover, the definition also implies that is a prefix of , and thus . If , this already proves our statement, so in the remainder of the proof we assume .

First, let us suppose that . We shall prove that and is a period of . Let be a string such that . Note that is a border of (as is a border of ), so is also a border of (because is a prefix of , which is a prefix of ). Moreover, by definition of the set, must be a power of a Lyndon word. Lyndon words do not proper borders, so any border of must be a power of the same Lyndon word. Thus, is a power of . As is a Lyndon word and a prefix of , this means that . Consequently, since . What is more, as is a prefix of , we conclude that is a period of . Therefore, is also a period of .

It remains to prove that implies that is not a period of . For a proof by contradiction suppose that both and is a period of . Let us define so that . As is a period of and contained in , we conclude that is a substring of , and consequently is also a period of and hence a period of as well. However, by definition of the set, is a power of a Lyndon word whose length exceeds and thus also . This Lyndon word cannot have a proper border, and such a border is induced by period , a contradiction.

Finally, observe that the second claim easily follows from .

We conclude with two combinatorial lemmas, useful to in determining for . The first of them is also applied later in .

Due to the characterization in , we may equivalently prove that is or . Clearly, , so it suffices to show that . This is clear if , so we assume .

This assumption in particular yields that consists of proper substrings of , and thus by the condition on the longest common prefix of and . However, the inequality in implies

This concludes the proof.

Let be a string such that . First, suppose that . In this case is a proper suffix of a Lyndon word , and thus and, moreover, . Thus, we may assume that .

Let and let be a string such that . Observe that it suffices to prove that , which implies that for . If there is nothing to prove, so we shall assume . Note that we have the Lyndon factorization with or . By , implies and is equivalent to (if ) or (if ). We have

as claimed. If , this already concludes the proof, and thus we may assume that . By definition of the Lyndon factorization we have , and by this implies . Hence, , which concludes the proof.

4Answering Minimal Suffix Queries

In this section we present our data structure for Minimal Suffix Queries. We proceed in three steps improving the query time from via to . The first solution is an immediate application of ( ?) and the notion of significant suffixes. Efficient computation of these suffixes, also used in the construction of further versions of our data structure, is based on , which yields a recursive procedure. The only “new” suffix needed at each step is determined using the following result, which can be seen as a cleaner formulation of Lemma 14 in [5].

Let and note that, by ( ?), . Let us focus on determining the latter value. The enhanced suffix array lets us compute a index , , which minimizes . Equivalently, we have . Consequently, for some . Since , is not a proper substring of , and by , we have (if , then ).

Thus, we shall generate a suffix of equal to if , and return the better of the two candidates for . If , we must have and there is nothing to do. Hence, let us assume . By , if we compute an index , , which minimizes , we shall have provided that . Now, can be generated as the largest integer such that is a suffix of , and we have , which lets us determine .

If , we return . Otherwise, we decompose so that . We recursively generate and use to compute . Then, we apply the characterization of to determine , using the enhanced suffix array () to lexicographically compare fragments of .

We store the lengths of the significant suffixes in an ordered list. This way we can implement a single phase (excluding the recursive calls) in time proportional to plus the number of suffixes removed from to obtain . Since this is amortized constant time, the total running time becomes as announced.

Recall that ( ?) yields where . Consequently, for some . We apply to compute and determine the answer among candidates using lexicographic comparison of fragments, provided by the enhanced suffix array ().

4.1-time Minimal Suffix Queries

An alternative -time algorithm could be developed based just on the second part of : decompose so that and return . The result is due to and ( ?). Here, the first candidate is determined via , while the second one using a recursive call. A way to improve query time to at the price of -time preprocessing is to precompute the answers for basic fragments, i.e., fragments whose length is a power of two. Then, in order to determine , we perform just a single step of the aforementioned procedure, making sure that is a basic fragment. Both these ideas are actually present in [5], along with a smooth trade-off between their preprocessing and query times.

Our -time query algorithm combines recursion with preprocessing for certain distinguished fragments. More precisely, we say that is distinguished if both and for some positive integer , where . Note that the number of distinguished fragments of length is at most .

The query algorithm is based on the following decomposition ( for ):

Let and . We determine as the largest integer strictly smaller than divisible by . By the assumption that , we conclude that . We define and partition so that is the largest possible power of two. This guarantees . Moreover, assures that , so , and therefore is indeed distinguished.

( ?) implies that . further yields . In other words, it leaves us with three candidates for . Our query algorithm obtains using , computes recursively, and determines through the characterization of . The latter step is performed using the following component based on a fusion tree, which we build for all distinguished fragments.

By , we have , so in order to determine , it suffices to store and efficiently compute given . We shall reduce these queries to queries in an integer set .

We shall prove that for each , , we have

First, if , then clearly and both sides of the equivalence are false. Therefore, we may assume . Observe that in this case is strictly less than , and . Hence, if and only if , as claimed.

We apply to build a fusion tree for , so that the ranks are can be obtained in time, which is by .

The construction algorithm uses to compute . Next, for each , , we need to determine . This is the same as and, by , can be retrieved as the suffix of of length . Hence, the enhanced suffix array can be used to compute these longest common prefixes and therefore to construct in time.

With this central component we are ready to give a full description of our data structure.

Our data structure consists of the enhanced suffix array () and the components of for all distinguished fragments of . Each such fragment of length contributes to the space consumption and to the construction time, which in total over all lengths sums up to .

Let us proceed to the query algorithm. Assume we are to compute the minimal suffix of a fragment . If (i.e., if ), we use the logarithmic-time query algorithm given in . If , we apply to determine a decomposition , which gives us three candidates for . As already described, is computed recursively, using , and using . The latter two both support constant-time queries, so the overall time complexity is proportional to the depth of the recursion. We have , so it terminates. Moreover,

Thus, unless . Consequently, unless , when the algorithm clearly needs constant time, the length of the queried fragment is in two steps reduced from to at most . This concludes the proof that the query time is .

4.2-time Minimal Suffix Queries

The time complexity of the query algorithm of is only due to the recursion, which in a single step reduces the length of the queried fragment from to where . Since , after just two steps the fragment length does not exceed . In this section we show that the minimal suffixes of such short fragments can precomputed in a certain sense, and thus after reaching we do not need to perform further recursive calls.

For constant alphabets, we could actually store all the answers for all strings of length up to . Nevertheless, in general all letters of , and consequently all fragments of , could even be distinct. However, the answers to Minimal Suffix Queries actually depend only on the relative order between letters, which is captured by order-isomorphism.

Two strings and are called order-isomorphic [27], denoted as , if and for every two positions () we have Note that the equivalence extends to arbitrary corresponding fragments of and , i.e., . Consequently, order-isomorphic strings cannot be distinguished using Minimal Suffix Queries or Generalized Minimal Suffix Queries.

Moreover, note that every string of length is order-isomorphic to a string over an alphabet . Consequently, order-isomorphism partitions strings of length up to into equivalence classes. The following fact lets us compute canonical representations of strings whose length is bounded by .

To compute , we first build a fusion tree storing all (distinct) letters which occur in . Next, we replace each character of with its rank among these letters. We allocate bits per character and prepend such a representation with bits encoding . This way is a sequence of bits. Using to build the fusion tree, we obtain an -time evaluation algorithm.

To answer queries for short fragments of , we define overlapping blocks of length : for we create a block . For each block we apply to compute the identifier . The total length of the blocks is bounded , so this takes time. The identifiers use bits of space.

Moreover, for each distinct identifier , we store the answers to all the Minimal Suffix Queries queries in . This takes bits per answer, and in total. Since , this is . The preprocessing time is also .

It is a matter of simple arithmetic to extend a given fragment of , , to a block . We use the precomputed answers stored for to determine the minimal suffix of . We only need to translate the indices within to indices within before we return the answer. The following theorem summarizes our contribution for short fragments:

As noted at the beginning, this can be used to speed up queries for arbitrary fragments:

5Answering Generalized Minimal Suffix Queries

In this section we develop our data structure for Generalized Minimal Suffix Queries. We start with preliminary definitions and then we describe the counterparts of the three data structures presented in . Their query times are , , and , respectively, i.e., there is an overhead compared to Minimal Suffix Queries.

We define a -fragment of a text as a concatenation of fragments of the text . Observe that a -fragment can be stored in space as a sequence of pairs . If a string admits such a decomposition using () substrings, we call it a -substring of . Every -fragment (with ) whose value is equal to is called an occurrence of as a -substring of . Observe that a substring of a -substring of is itself a -substring of . Moreover, given an occurrence of , one can canonically assign each fragment of to a -fragment of (). This can be implemented in time and referring to in our algorithms, we assume that such an operation is performed.

Basic queries regarding -fragments easily reduce to their counterparts for 1-fragments:

Generalized Minimal Suffix Queries can be reduced to the following auxiliary queries:

Let . By ( ?), or for some , , we have . Hence, we apply Auxiliary Minimal Suffix Queries to determine for each . ( ?) lets reduce computing to another auxiliary query. Having obtained candidates for , we use the enhanced suffix array to return the smallest among them using comparisons, each performed in time; see .

We apply to determine , and then we compute the smallest string among . These strings are -fragments of and thus a single comparison takes time using the enhanced suffix array.

5.1-time Auxiliary Minimal Suffix Queries

Our data structure closely follows its counterpart described in . We define distinguished fragments in the same manner and provide a recursive algorithm based on . However, for each distinguished fragment instead of applying , we build the following much stronger data structure. Its implementation is provided in Section ?.

If (), we use to compute in time. Otherwise, we apply to decompose so that is distinguished, , and , where . The characterization of again gives three candidates for : , , and . We determine the first using , the second using , while the third is computed recursively. The application of takes time, since is a -fragment of . We return the best of the three candidates using the enhanced suffix array to choose it in time. Since , the depth of the recursion is . This concludes the proof of the following result:

Rank queries in a collection of fragments

The crucial tool we use in the proof of is a data structure constructed for a collection of fragments of to support queries for arbitrary -fragments of . Since it heavily relies on the compressed trie of fragments in , we start by recalling several related concepts.

A trie is a rooted tree whose nodes correspond to prefixes of strings in a given family of strings . If is a node, the corresponding prefix is called the value of the node. The node whose value is is called the locus of .

The parent-child relation in the trie is defined so that the root is the locus of , while the parent of a node is the locus of the value of with the last character removed. This character is the label of the edge from and . In general, if is a ancestor of , then label of the path from to is the concatenation of edge labels on the path.

A node is branching if it has at least two children and terminal if its value belongs to . A compressed trie is obtained from the underlying trie by dissolving all nodes except the root, branching nodes, and terminal nodes. Note that this way we compress paths of vertices with single children, and thus the number of remaining nodes becomes bounded by . In general, we refer to all preserved nodes of the trie as explicit (since they are stored explicitly) and to the dissolved ones as implicit. Edges of a compressed trie correspond to paths in the underlying tree and thus their labels are strings in . Typically, these labels are stored as references to fragments of the strings in .

Before we proceed with ranking a -fragment in a collection of fragments, let us prove that fusion trees make it relatively easy to rank a suffix in a collection of suffixes.

Let . We build a fusion tree storing and during a query for , we determine the predecessor and the successor of . We use the table to translate these integers into indices and . Since the order of coincides with the lexicographic order of suffixes , the suffixes and are the predecessor and the successor , respectively. These are the two candidates for maximizing . We perform two longest common prefix queries and return the candidate for which the obtained value is larger.

Let and let be the compressed trie of fragments in . Note that can be easily constructed in time using the enhanced suffix array. For each edge we store a fragment of representing its label and for each terminal node its rank in . Moreover, for each explicit node of we store pointers to the first and last (in pre-order) terminal nodes in its subtree as well as the following two components: a fusion tree containing the children of indexed by the first character of the corresponding edge label, and a data structure of for , where is the (weighted) depth of and contains whenever the locus of is in the subtree of . Finally, for each we store a fusion tree containing (pointers to) all explicit nodes of which represent prefixes of , indexed by their (weighted) node depths. All these components can be constructed in time, with applied to build fusion trees.

Let us proceed to the description of a query algorithm. Let be the decomposition of the given -fragment into -fragments, and let for . We shall scan all consecutively and after processing , store a pointer to the (possibly implicit) node defined as the locus of the longest prefix of present in . We start with whose locus is the root of . Therefore, it suffices to describe how to determine provided that we know .

If is at depth smaller than , there is nothing to do, since . Otherwise, we proceed as follows: Let be the nearest explicit descendant of ( if is explicit), and let be a fragment of representing the label from to . First, we check if is a proper prefix of . If not, is on the same edge of and its depth . Thus, we may assume that is a proper prefix of . Let . We make a query to the data structure of built for with as the query suffix. This lets us determine an index such that