Substring Suffix Selection

Substring Suffix Selection

Maxim Babenko Higher School of Economics, Moscow, Russia, maxim.babenko@gmail.com,tat.starikovskaya@gmail.com    Paweł Gawrychowski Max-Planck-Institut für Informatik, Saarbrücken, Germany, gawry@cs.uni.wroc.pl    Tomasz Kociumaka Institute of Informatics, University of Warsaw, Warsaw, Poland, kociumaka@mimuw.edu.pl    Tatiana Starikovskaya Higher School of Economics, Moscow, Russia, maxim.babenko@gmail.com,tat.starikovskaya@gmail.com
Abstract

We study the following substring suffix selection problem: given a substring of a string of length , compute its -th lexicographically smallest suffix. This a natural generalization of the well-known question of computing the maximal suffix of a string, which is a basic ingredient in many other problems.

We first revisit two special cases of the problem, introduced by Babenko, Kolesnichenko and Starikovskaya [CPM’13], in which we are asked to compute the minimal non-empty and the maximal suffixes of a substring. For the maximal suffixes problem, we give a linear-space structure with query time and linear preprocessing time, i.e., we manage to achieve optimal construction and optimal query time simultaneously. For the minimal suffix problem, we give a linear-space data structure with query time and preprocessing time, where is a parameter of the data structure. As a sample application, we show that this data structure can be used to compute the Lyndon decomposition of any substring of in time, where is the number of distinct factors in the decomposition.

Finally, we move to the general case of the substring suffix selection problem, where using any combinatorial properties seems more difficult. Nevertheless, we develop a linear-space data structure with query time.

1 Introduction

Computing the -th lexicographically smallest suffix of a string is both an interesting problem on its own, and a crucial ingredient in solutions to many other problems. As an example of the former, a well-known result by Duval [9] is that the maximal suffix of a string can be found in linear time and constant additional space. As an example of the latter, the famous constant space pattern matching algorithm of Crochemore-Perrin is based on the so-called critical factorizations, which can be found by looking at maximal suffixes [7]. In the more general version, a straightforward way to compute the -th suffix of a string is to construct its suffix array, which results in a linear time and space solution, assuming that we can sort the letters in linear time. Surprisingly, one can achieve linear time complexity even without such assumption, as shown by Franceschini and Muthukrishnan [10].

We consider a natural generalization of the question of locating the -th suffix of a string. We assume that the string we are asked to compute the -th suffix for is actually a substring of a longer text given in advance. Information about can be preprocessed and then used to significantly speed up the computation of the desired suffixes of a query string. This seems to be a very natural setting whenever we are thinking about storing large collections of text data. Other problems studied in such version include the substring-restricted pattern matching, where we are asked to return occurrences of a given word in some specified interval [4], the factor periodicity problem, where we are asked to compute the period of a given substring [13], and substring compression, where the goal is to output compressed representation of given substring [6, 12].

We start with two special cases of the problem, namely, computing the minimal non-empty and the maximal suffixes of a substring of . These two problems were introduced in [2]. The authors proposed two linear-space data structures for . Using the first data structure, one can compute the minimal suffix of any substring of in time. The second data structure allows to compute the maximal suffix of a substring of in time. Here we improve upon both of these results. First, we describe a series of linear-space data structures that allow, for any , to compute the minimal suffix of a substring of in time. Construction time is . Secondly, we describe a linear-space data structure for the maximal suffix problem with query time. The data structure can be constructed in linear time. Computing the maximal or the minimal suffix is a fundamental tool used in more complex algorithms, so our results can hopefully be used to efficiently solve also other problems in such setting, i.e., when we are working with substrings of some long text . As a particular application, we show how to compute the Lyndon decomposition [5] of a substring of in time, where is the number of distinct factors in the decomposition.

We then proceed to the general case of the problem, which is much more interesting from the practical point of view. It is also substantially more difficult, mostly because the -th suffix of a substring does not enjoy the combinatorial properties the minimal and the maximal suffixes have. Nevertheless, we are able to propose a linear-space data structure with query time for the general case.

Our data structures are designed for the standard word-RAM model, see [1] for a definition. We assume that letters in can be sorted in time.

2 Preliminaries

We start by introducing some standard notation and definitions. Let be a finite ordered non-empty set (called an alphabet). The elements of  are letters.

A finite ordered sequence of letters (possibly empty) is called a string. Letters in a string are numbered starting from 1, that is, a string of length consists of letters . The length  of is denoted by . For , denotes the substring of from position to position (inclusive). If , is defined to be the empty string. Also, if or then we omit these indices and we write just and . Substring is called a prefix of , and is called a suffix of . A border of a string  is a string that is both a prefix and a suffix of  but differs from .

A string is called periodic with period if for an integer and a (possibly empty) proper prefix of . When this leads to no confusion, the length of will also be called a period of . Borders and periods are dual notions; namely, if has period  then it has a border of length , and vice versa (see, e.g., [8]).

Letters are treated as integers in a range ; a pair of letters can be compared in time. This lexicographic order over is linear and can be extended in a standard way to the set of strings in . Namely, if either (i) is a prefix of ; or (ii) there exists such that , and .

3 Suffix Array and Related Data Structures

Consider a fixed string . For let denote . The set of all non-empty suffixes of is also denoted as . The suffix array of a string  is a permutation of defining the lexicographic order on . More precisely, if the rank of in the lexicographic order on is . The inverse permutation is denoted by ; it reduces lexicographic comparison of suffixes and to integer comparison of their ranks and . For a string , both and occupy linear space and can be constructed in linear time (see [14] for a survey). For strings we denote the length of their longest common prefix by , and of their longest common suffix by .

While and its reverse are useful themselves, equipped with additional data structures they are even more powerful. We use several classic applications listed below.

Lemma 1

A string of length can be preprocessed in time so that the following queries can be answered in time:

  1. given substrings , compute and determine if ,

  2. given indices compute the maximal and minimal suffix in ,

Proof

Queries (a) is a classic application of the LCP array equipped with the data structure for range minimum queries, see [7] for details. Queries (b) are just range minimum (maximum) queries on , it suffices to equip with the appropriate data structure [3].

These simple queries can be used to answer more involved ones.

Lemma 2

The enhanced suffix array can answer the following queries in constant time. Given substrings of  compute the largest integer such that is a prefix of .

Proof

Let . If , then the answer is clearly 0. Otherwise, we claim . Indeed, if , then , i.e. , since by maximality of . On the other hand a simple inductive argument shows that implies that is a prefix of .

Note that the queries on the enhanced suffix array of , the reverse of , are also meaningful in terms of . In particular for a pair of substrings we can compute and the largest integer such that is a suffix of .

4 Minimal Suffix

Consider a string of length . In this section we first describe a linear-space data structure for that can be constructed in time and allows to compute the minimal non-empty suffix of any substring in time. Then we explain how to modify the data structure to obtain construction time and query time for any .

For each we select substrings , which we call canonical. We denote the -th longest canonical substring ending at position by . The substring is . For we set and define so that

Note that the number of such substrings is logarithmic for each . Moreover, if we split into chunks of size each, then will start at the boundary of one of these chunks. This alignment property will be crucial for the construction algorithm. Below we explain how to use canonical substrings to compute the minimal suffix of . We start with two auxiliary facts.

Fact 4.1

For any and with we have .

Proof

For the statement holds trivially. Consider . Let , as before, denote . If is even, then is odd and we have

while for odd

For a pair of integers , define to be the largest integer such that is a proper suffix of .

Fact 4.2

Given integers , the value can be computed in constant time.

Proof

Let . Observe that

and

Consequently is equal to , or which can be verified in constant time.

Lemma 3

The minimal suffix of is either equal to

  1. , where is the starting position of the minimal suffix in , or

  2. the minimal suffix of .

Proof

By Lemma  in [2] the minimal suffix is either equal to or to its shortest non-empty border. Moreover, in the latter case the length of the minimal suffix is at most . On the other hand, by Fact 4.1 the length of is at least . Thus, in the second case the minimal suffix of is the minimal suffix of .

Recursively applying Lemma 3 we obtain the following

Corollary 1

For let be the minimal suffix in , and let be the minimal suffix in . The minimal suffix of starts at one of the positions in .

With some knowledge about the minimal suffixes of canonical substrings, the set of candidate positions can be reduced.

Observation 4.3

For any such that :

  1. if the minimal suffix of is longer than , then positions do not need to be considered as candidates in Corollary 1,

  2. if the minimal suffix of is not longer than , then positions do not need to be considered as candidates in Corollary 1.

We now explain how this result is used to achieve the announced time and space bounds.

4.1 Data Structure

Apart from the enhanced suffix array, we store, for each , a bit vector of length . Here if and only if or the minimal suffix of is longer than . Since , each vector can be stored in a constant number of machine words, which gives space in total.

4.2 Query

To compute the minimal suffix of , we determine (see Fact 4.2) and locate the highest set bit such that . As and for Observation 4.3 implies that the minimal suffix starts at either or . is the minimum in , and is the minimum in . Hence the enhanced suffix array can be used to compute and as well as find the smaller of and , all in time.

4.3 Construction

It suffices to explain how vectors are computed. At the beginning we set all bits to . For each we compute the minimal suffixes of , where and . To do this, we first divide into chunks of length . Each substring starts at the beginning of one of these chunks and has length smaller than . Hence is a prefix of a substring consisting of at most four consecutive chunks. Recall that a variant of Duval’s algorithm (see Algorithm 3.1 in [9]) takes linear time to compute the lengths of minimal suffixes of all prefixes of a given string. We run this algorithm for each four (or less at the end) consecutive chunks and thus obtain the minimal suffixes of the substrings , where and , in time. The value of can now be found directly by comparing the length of minimal suffix of with . Note that the space usage is . We proved

Theorem 4.4

A string of length can be stored in an -space structure that enables to compute the minimal suffix of any substring of in time. This data structure can be constructed in time.

To obtain a data structure with construction and query time, we define the bit-vectors in a slightly different way. We set to be of size with if and only if or the minimal suffix of is longer than . This way we need only phases in the construction algorithm, so it takes time.

Again, let denote the starting position of the minimal suffix in . To compute the minimal suffix of , we determine and locate the highest set bit , . Then, by Observation 4.3 and for implies the minimal suffix starts at one of the positions . Each of these positions can be computed in constant time, each two of the suffixes can be compared in constant time as well. That is, the data structure allows to compute the minimal suffix of any substring in time. Summing up,

Theorem 4.5

For any , a string of length can be stored in an -space data structure that enables to compute the minimal suffix of any substring of in time. This data structure can be constructed in time.

4.4 Applications

As a corollary we obtain an efficient data structure for computing Lyndon decompositions of substrings of . We recall the definitions first. A string is said to be a Lyndon word if and only if it is strictly smaller than its proper cyclic rotations. For a nonempty string a decomposition is called a Lyndon decomposition if and only if are Lyndon words, see [5].

Lemma 4 ([9])

If is a Lyndon decomposition, then is the minimal suffix of .

Lemma 5

Let , where is the minimal suffix of and does not end with . Let be the Lyndon decomposition of . Then is the Lyndon decomposition of .

Proof

Any word admits a unique Lyndon decomposition [5]. Let be the Lyndon decomposition of . From Lemma 4 we obtain that , moreover is the minimal suffix of , so . Clearly , which proves equality. From the definition it follows that is the Lyndon decomposition of and hence it coincides with the decomposition . The claim follows.

Corollary 2

For any a string of length can be stored in an -space data structure that enables to compute the Lyndon decomposition of any substring of in time, where is the number of distinct factors in the decomposition. This data structure can be constructed in time.

5 Maximal Suffix

We now turn to the maximal suffix problem. Our solution is based on the following notion. For we say that a position is -active if there is no position such that . Equivalently, is -active exactly when the suffix is the maximal suffix of some substring of ending at . From the definition it follows that a starting position of the maximal suffix of is the minimal -active position in .

Example 1

If , the -active positions are .

We will not store -active positions for each explicitly because there can be too many of them. Instead we will consider, for each , a partition of an interval into a number of disjoint subintervals. For this partition we will keep a bit vector where set bits correspond to the subintervals containing -active positions. Computing the maximal suffix of will consist of three steps: first, we compute the subinterval belongs to, call it , and, using the bit vector, the leftmost subinterval completely to the right and containing a -active position, call it . Then the minimal -active position must lie in one of these two subintervals. More precisely, it either lies in , or in . We separately compute the maximal suffix of starting in these two subintervals, and return the lexicographically larger one.

5.1 Data Structure

Our data structure for computing maximal suffixes of substrings of consists of two parts. Partitions and bit vectors will be used to locate the first subinterval to the right of that contains a -active suffix, and data structures associated with suffix arrays of and for the reverse of will be used to compute the minimal -active position in this subinterval.

Nice partitions and bit vectors: Nice partitions are defined recursively. The nice partition of consists of disjoint subintervals and satisfies the following properties:

  1. ;

  2. Length of each subinterval is a power of two;

  3. Lengths of each two consecutive subintervals are the same, or differ by a factor of two;

  4. There are no three subintervals of equal length.

The nice partition of an interval consists of the interval itself. Given a nice partition of we can create a nice partition of by adding a new interval . Then it might be the case that there are three intervals of length . In such case we merge the two leftmost ones into one of length and repeat until there are at most 2 intervals of each length. The result is a nice partition of satisfying properties 1-4.

For each we store a bit vector of length indicating which subintervals of the partition contain -active positions.

We will also make use of two pre-computed tables. For each and for each we store the number of set bits in a prefix of of length and the position of storing the -th set bit. This way we can answer any rank/select query on a bit vector of length by a constant number of table look-ups.

The second table will be used for locating the subinterval of the partition of containing . The partition of is completely determined by specifying such that last subinterval is of length , and one word of length , where the -th bit is set when there are two blocks of length in the partition. We store the answers for each and for each possible position not larger than . Again, we are able to process a query with a constant number of table lookups.

5.2 Query

Suppose that we are asked to find the maximal suffix of a substring . Recall that we want to do this in three steps: first, locate the subinterval belongs to, call it , then find the leftmost interval on its right containing a -active suffix, call it . Using the second table we compute the subinterval of the partition of containing , and then we can use rank/select queries to retrieve the leftmost subinterval to the right containing -active position. Overall, the first step takes constant time.

The second step is to compute the lexicographically maximal suffix of assuming that it starts in , and the third step is to compute the lexicographically maximal suffix of assuming that it starts in . Both these steps are actually very similar: it is enough to show how to find the lexicographically maximal suffix of assuming that it starts in , where (such assumption follows from the definition of a nice partition, where in the worst possible case ). For this we need some combinatorial properties of maximal suffixes which we prove below. Let be the desired lexicographically maximal suffix of . The goal will be to show that knowing the length of up to a factor of two is actually enough to compute in time.

Lemma 6 (Lemma  in [2])

Let be a prefix of and let , where is the maximal suffix in . If is not a prefix of , then . Otherwise, is also a prefix of and moreover .

Let be the maximal suffix in and the maximal suffix in . Assume that starts somewhere in , so that is a prefix of . Define and assume that is a prefix of (if not, the above lemma immediately gives us ). We state two more lemmas which describe the properties of such suffixes and when the length of is smaller than (i.e., when ). These lemmas are essentially Lemmas  and  of [2], but because we use different notation, we repeat their proofs here.

Lemma 7

With the notation above, is the shortest period of , i.e., where and is a proper prefix of , and is the shortest string for which such decomposition exists. Moreover, actually .

Proof

Since is a border of , is a period of . It remains to prove that no shorter period is possible. So, consider the shortest period , and assume that . Then , and by the periodicity lemma substring  has another period . Since is the shortest period, must be a multiple of , i.e., for some .

Suppose that . Then prepending both parts of the latter inequality by copies of gives for any , so from the transitivity of we get that , which contradicts the maximality of in . Therefore , and consequently . But and , so is larger than and belongs to , which is a contradiction.

The final observation that follows from the condition that .

Figure 1: A schematic illustration of Lemma 8.
Lemma 8

Suppose that . If , then is the longest suffix of equal to for some integer , see also Fig. 1

Proof

Clearly is a border of . Since this implies . Consequently the occurrences of as a prefix and as a suffix of have an overlap with at least positions. As is a period of , this implies that is also a period of . Thus , where is an integer and is a proper suffix of . Moreover is a prefix of , since it is a prefix , which is a prefix of . Now would imply a non-trivial occurrence of in , which contradicts being primitive. Thus . If , then , so is the longest suffix of equal to for some integer .

Lemma 9

Given a subinterval such that , and assuming that the lexicographically largest suffix of starts there, we can compute in time.

Proof

Let be the starting position of the maximal suffix in , then is a prefix of . Let be the starting position of the maximal suffix in . and can be computed time using two range maxima queries on . Then we check if is a prefix of . If not, by Lemma 6 . Otherwise we set and define to be the largest integer such that is a suffix of (or, equivalently, is a suffix of ) and set . Lemma 2 allows to compute in time using the enhanced suffix array of . The suffixes of starting within are within multiplicative factor 2 from each other, so Lemmas 7 and 8 imply .

We apply the above lemma twice for the subintervals and found in the first step. Finally, we compare the suffixes of found in the second and third step in constant time, and return the larger one.

5.3 Construction

We start the construction with building the tables, which takes time. In the main phase we scan positions of from the left to the right maintaining the list of active positions and computing the bit vectors.

We start with a lemma describing changes in the list of active suffixes upon a transition from to .

Lemma 10

If the list of all -active positions consists of , the list of -active positions can be created by adding , if or , and repeating the following procedure: if and are two neighbours on the current list, and , remove or from the list, depending on whether or , respectively.

Proof

First we prove that if a position is not -active, then it is not -active either. Indeed, if is not -active, then by the definition there is a position such that . Consequently, and is not -active. Hence, the only possible candidates for -active positions are -active positions and a position .

Secondly, note that if is a -active position and is a prefix of , then is -active too. Suppose the contrary. Then there exists a position , , such that , and it follows that , a contradiction.

A -active position is not -active only if (1) or (2) there exists such that is a prefix of , i.e., is -active, and , or, equivalently, . Both of these cases will be detected by the deletion procedure.

Example 2

If , the -active positions are , and the -active positions are , i.e., we add to the list of -active positions, and then remove .

The list of active positions will be maintained in the following way. After transition from the list of -active positions to the list of -active positions new pairs of neighbouring positions appear. For each such pair we compute and hence the smallest when one of them should be removed from the list, and add a pointer from to the pair .

When we actually reach , we check if and are still neighbours. If they are, we remove the appropriate element from the current list. Otherwise we do nothing. From Lemma 10 it follows that the two possible updates of the list under transition from to are adding or deleting some position from the list. This guarantees that the process of deletion described in Lemma 10 and the process we have just described are actually equivalent.

Suppose that we already know the list of -active positions, the bit vector describing the nice partition of , and the number of -active positions in each subinterval of the partition. First we update the list of -active positions. When a position is deleted from the list, we use the pre-computed table to find the subinterval the position belongs to, and decrement the counter of active positions in this subinterval. If the counter becomes equal to zero, we set the corresponding bit of the bit vector to zero. Then we start updating the partition: first we append a new subinterval to the partition of and initialize the counter of active positions in this subinterval by one. If then we have three intervals of length , we merge the two leftmost ones into one interval of length , add their counters, update the bit vectors, and repeat, if necessary. All these operations will take amortized time.

Theorem 5.1

A string of length can be stored in an -space structure that allows computing the maximal suffix of any substring of in time. The data structure can be constructed in time.

6 General Substring Suffix Selection

In the previous sections we considered the problems of computing the minimal and the maximal suffixes of a substring. Here we develop a data structure for the general case of the suffix selection problem. Recall that the query, given a substring and an integer , returns the (length of) the -th smallest suffix of .

For strings we define as the number of suffixes of not larger than . Our data structure is based on the following fact.

Fact 6.1

Let be the -th smallest suffix of and let be the minimal suffix of such that . Then is a prefix of , and there are exactly longer prefixes of which are simultaneously suffixes of .

Proof

Let . Observe that and , so which means that is indeed a prefix of . A similar reasoning shows that is a prefix of for each . Conversely, any suffix of larger than but not than must be a prefix of , so are exactly the longer prefixes of simultaneously being suffixes of .

The query algorithm performs a binary search to determine , calling a subroutine to compute . Then for it finds the -th largest prefix of simultaneously being a suffix of .

The second step is performed using Prefix-Suffix Queries, defined as follows. For given substrings , of we are supposed to find (the lengths of) all prefixes of which are simultaneously suffixes of . The lengths are reported as a sequence of sets such that , for each values in form an arithmetic progression and each element of is larger than each element of .

Lemma 11

For any Prefix-Suffix Queries can be answered in time by a data structure of size .

Proof

In [13] Kociumaka et al. considered similar queries, where (the lengths of) all borders of a given substring were reported. Here it suffices to store their data structure for . Given a query with and it is enough find borders of and filter out those longer than .

Now it suffices to show how can be efficiently computed.

Lemma 12

For any and a string of length , there is an -size data structure that given integers computes in time.

Proof

Note that if instead of counting suffixes of which are not larger than we were to count suffixes in which are not larger than , our problem could be immediately reduced to 2D orthogonal range counting on a set , the query rectangle would be . While our queries require more attention, we still use the data structure of [11], which stores using space and answers range counting queries in time.

Observe that the number of suffixes of smaller than is equal to the number of suffixes in smaller than plus the number of suffixes in which are bigger than , but trimmed at the position become smaller than (i.e. trimmed suffixes become prefixes of ). The first term is determined using range counting as described above, while computing the second one is a bit trickier. We use Prefix-Suffix Queries to find all suffixes of which are simultaneously prefixes of , and for each arithmetic progression reported we count suffixes that are bigger than .

Consider one of the progressions. Let be the starting positions as suffixes of . Then all substrings are equal to . This means that , , can be represented as , where and is a fixed string which does not start with . Let , where is the maximal exponent possible. If , then the order between and coincides with the order between and , if , then the order coincides with the order between and , and in the case the order is defined by the order between and . It follows that to compute the number of suffixes bigger than we are to determine and , and to compare at most three pairs of substrings of T. This can be done in constant time using the enhanced suffix array.

Theorem 6.2

For any there is a data structure of size , which can answer substring suffix selection queries in time.

7 Conclusion

In this paper we studied the substring suffix selection problem. We first revisited two special cases of the problem. For the problem of computing the minimal suffix of a substring we proposed a series of linear-space data structures with query time and