Optimal CacheAware Suffix Selection
Abstract.
Given string and integer , the suffix selection problem is to determine the th lexicographically smallest amongst the suffixes , . We study the suffix selection problem in the cacheaware model that captures twolevel memory inherent in computing systems, for a cache of limited size and block size . The complexity of interest is the number of block transfers. We present an optimal suffix selection algorithm in the cacheaware model, requiring block transfers, for any string over an unbounded alphabet (where characters can only be compared), under the common tallcache assumption (i.e. , where ). Our algorithm beats the bottleneck bound for permuting an input array to the desired output array, which holds for nearly any nontrivial problem in hierarchical memory models.
Gianni Franceschini \@ifemptyuniroma
uniroma Roberto Grossi \@ifemptyunipi
unipi S. Muthukrishnan \@ifemptygoogle
section1[Introduction]Introduction
paragraph�[Background: Selection vs Sorting]Background: Selection vs Sorting
A collection of numbers can be sorted using comparisons. On the other hand, the famous fiveauthor result [2] from early 70’s shows that the problem of selection — choosing the th smallest number — can be solved using comparisons in the worst case. Thus, selection is provably simpler than sorting in the comparison model.
Consider a sorting vs selection question for strings. Say is a string. The suffix sorting problem is to sort the suffixes , , in the lexicographic order. In the comparison model, we count the number of character comparisons. Suffix sorting can be performed with comparisons using a combination of character sorting and classical data structure of suffix arrays or trees [11, 9, 4]. There is a lower bound of since sorting suffixes ends up sorting the characters. For the related suffix selection problem where the goal is to output the th lexicographically smallest suffix of , the result in [6] recently gave an optimal comparisonbased algorithm, thereby showing that suffix selection is provably simpler than suffix sorting.
paragraph�[The Model]The Model
Timetested architectural approaches to computing systems provide two (or more) levels of memory: the highest one with a limited amount of fast memory; the lowest one with slow but large memory. The CPU can only access input stored on the fastest level. Thus, there is a continuous exchange of data between the levels. For cost and performance reasons, data is exchanged in fixedsize blocks of contiguous locations. These transfers may be triggered automatically like in internal CPU caches, or explicitly, like in the case of disks; in either case, more than the number of computing operations executed, the number of block transfers required is the actual bottleneck.
Formally, we consider the model that has two memory levels. The cache level contains locations divided into blocks (or cache lines) of contiguous locations, and the main memory level can be arbitrarily large and is also divided into blocks. The processing unit can address the locations of the main memory but it can process only the data residing in cache. The algorithms that know and exploit the two parameter and , and optimize the number of block transfers are cacheaware. This model includes the classical External Memory model [1] as well as the wellknown IdealCache model [7].
paragraph�[Motivation]Motivation
Suffix selection as a problem is useful in analyzing the order statistics of suffixes in a string such as the extremes, medians and outliers, with potential applications in bioinformatics and information retrieval. A quick method for finding say the suffixes of rank for each integer , , may be used to partition the space of suffixes for understanding the string better, load balancing and parallelization. But in these applications, such as in bioinformatics, the strings are truly massive and unlikely to fit in the fastest levels of memory. Therefore it is natural to analyze them in a hierarchical memory model.
Our primary motivation however is really theoretical. Since the inception of the first blockbased hierarchical memory model ([1],[10]), it has been difficult to obtain “golden standard” algorithms i.e., those using just block transfers. Even the simplest permutation problem (perm henceforth) where the output is a specified permutation of the input array, does not have such an algorithm. In the standard RAM model, perm can be solved in time. In both the IdealCache and External Memory models, the complexity of this problem is denoted perm. Nearly any nontrivial problem one can imagine from list ranking to graph problems such as Euler tours, DFS, connected components etc., sorting and geometric problems have the lower bound of perm, even if they take time in the RAM model, and therefore do not meet the “golden standard”. Thus the lower bound for perm is a terrible bottleneck for blockbased hierarchical memory models.
The outstanding question is, much as in the comparison model, is suffix selection provably simpler than suffix sorting in the blockbased hierarchical memory models? Suffix sorting takes block transfers [5]. Proving any problem to be simpler than suffix sorting therefore requires one to essentially overcome the perm bottleneck.
paragraph�[Our Contribution]Our Contribution
We present a suffix selection algorithm with optimal cache complexity. Our algorithm requires block transfers, for any string over an unbounded alphabet (where characters can only be compared) and under the common tallcache assumption, that is with . Hence, we meet the “golden standard”; we beat the perm bottleneck and consequently, prove that suffix selection is easier than suffix sorting in blockbased hierarchical memory models.
paragraph�[Overview]Overview
Our high level strategy for achieving an optimal cacheaware suffix selection algorithm consists of two main objectives.
In the first objective, we want to efficiently reduce the number of candidate suffixes from to , where we maintain the invariant that the wanted th smallest suffix is surely one of the candidate suffixes.
In the second objective, we want to achieve a cache optimal solution for the sparse suffix selection problem, where we are given a subset of suffixes including also the wanted th suffix. To achieve this objective we first find a simpler approach to suffix selection for the standard comparison model. (The only known linear time suffix selection algorithm for the comparison model [6] hinges on wellknown algorithmic and data structural primitives whose solutions are inherently cache inefficient.) Then, we modify the simpler comparisonbased suffix selection algorithm to exploit, in a cacheefficient way, the hypothesis that (known) suffixes are the only plausible candidates.
paragraph�[Map of the paper.]Map of the paper. We will start by describing the new simple comparisonbased suffix selection algorithm in Section Optimal CacheAware Suffix Selection. This section is meant to be intuitive. We will use it to derive a cacheaware algorithm for the sparse suffix selection problem in Section Optimal CacheAware Suffix Selection. We will present our optimal cacheaware algorithm for the general suffix selection problem in Section Optimal CacheAware Suffix Selection.
section1[A Simple(r) LinearTime Suffix Selection Algorithm]A Simple(r) LinearTime Suffix Selection Algorithm
We now describe a simple algorithm for selecting the th lexicographically smallest suffix of in main memory. We give some intuitions on the central notion of work, and some definitions and notations used in the algorithm. Next, we show how to perform main iteration, called phase transition. Finally, we present the invariants that are maintained in each phase transition, and discuss the correctness and the complexity of our algorithm.
paragraph�[Notation and intuition]Notation and intuition
Consider the regular lineartime selection algorithm [2], hereafter called bfprt. Our algorithm for a string uses bfprt as a black box.^{1}^{1}1In the following, we will assume that the last symbol in is an endmarker , smaller than any other symbol in . Each run of bfprt permits to discover a longer and longer prefix of the (unknown) th lexicographically smallest suffix of . We need to carefully orchestrate the several runs of bfprt to obtain a total cost of time. We use , where , as an illustrative example, and show how to find the median suffix (hence, ).
paragraph�[Phases and phase transitions.]Phases and phase transitions. We organize our computation so that it goes through phases, numbered and so on. In phase , we know that a certain string, denoted , is a prefix of the (unknown) th lexicographically smallest suffix of . Phase is the initial one: we just have the input string and no knowledge, i.e., is the empty string. For , a main iteration of our algorithm goes from phase to phase and is termed phase transition : it is built around the th run of bfprt on a suitable subset of the suffixes of . Note that , since we ensure that the condition holds, namely, each phase transition extends the known prefix by at least one symbol.
paragraph�[Phase transition .]Phase transition . We start out with phase 0, where we run bfprt on the individual symbols of , and find the symbol of rank in (seen as a multiset). Hence we know that , and this fact has some implications on the set of suffixes of . Let denote the th suffix of , for , and be a special prefix of called work. We anticipate that the works play a fundamental role in attaining time. To complete the phase transition, we set for , and we call degenerate the works such that . (Note that degenerate works are only created in this phase transition.) We then partition the suffixes of into two disjoint sets:

The set of active suffixes, denoted by —they are those suffixes such that .

The set of inactive suffixes, denoted by and containing the rest of the suffixes—none of them is surely the th lexicographically smallest suffix in .
In our example (), we have and, for , and . Also, we have for , where and are degenerate works.
A comment is in order at this point. We can compare any two works in constant time, where the outcome of the comparison is ternary . While this observation is straightforward for this phase transition, we will be able to extend it to longer works in the subsequent transitions. Let us discuss the transition from phase to phase to introduce the reader to the main point of the algorithm.
paragraph�[Phase transition .]Phase transition . If , we are done since there is only one active suffix and this should be the th smallest suffix in . Otherwise, we exploit the total order on the current works. Letting be the number of works smaller than the current prefix , our goal becomes how to find the th smallest suffix in . In particular, we want a longer prefix and the new set .
To this end, we need to extend some of the works of the active suffixes in . Consider a suffix . In order to extend its work , we introduce its prospective work. Recall that . If (hence, is inactive in our terminology), the prospective work for is the concatenation , where . Otherwise, since (and so ), we consider , and so on, until we find the first such that (and so ). In the latter case, the prospective work for is the concatenation , where and their corresponding suffixes are active, while is different and corresponds to an inactive suffix.
In any case, each prospective work is a sequence of works of the form , where and . The reader should convince herself that any two prospective works can be compared in time. We exploit this fact by running bfprt on the set of active suffixes and, whenever bfprt requires to compare any two , we compare their prospective works. Running time is therefore if we note that prospective works can be easily identified by a scan of : if is the prospective work for , then is the prospective work for , and so on. In other words, a consecutive run of prospective works forms a collision, which is informally a maximal concatenated sequence of works equal to terminated by a work different from (this notion will be described formally in Section Optimal CacheAware Suffix Selection).
After bfprt completes its execution, we know the prospective work that is a prefix of the (unknown) th suffix in . That prospective work becomes and is made up of the the suffixes in such that their prospective work equals (and we also set ).
In our example, , and so we look for the third smaller suffix in . We have the following prospective works: one collision is made up of , , and ; another collision is made up of , , , , and . Algorithm bfprt discovers that is the third prospective work among them, and so and (and ).
paragraph�[How to maintain the works.]How to maintain the works. Now comes the key point in our algorithm. For each suffix , we update its work to be (whereas it was in the previous phase transition, so it is now longer). For each suffix , instead, we leave its work unchanged. Note this is the key point: although can share a longer prefix with , the algorithm bfprt has indirectly established that cannot have as a prefix, and we just need to record a Boolean value for , indicating if is either lexicographically smaller or larger than . We can stick to unchanged, and discard its prospective work, since becomes inactive and is added to . In our example, , while the other works are unchanged (i.e, while , while , and so on).
In this way, we can maintain a total order on the works. If two works are of equal length, we declare that they are equal according to the symbol comparisons that we have performed so far, unless they are degenerate—in the latter case they can be easily compared as single symbols. If two works are of different length, say , then has been discarded by bfprt in favor of in a certain phase, so we surely know which one is smaller or larger. In other words, when we declare two works to be equal, we have not yet gathered enough symbol comparisons to distinguish among their corresponding suffixes. Otherwise, we have been able to implicitly distinguish among their corresponding suffixes. In our example, because they are of different length and bfprt has established this disequality, while we declare that since they have the same length. Recall that the total order on the works is needed for comparing any two prospective works in time as we proceed in the phase transitions. The works exhibit some other strong properties that we point out in the invariants described in Section Optimal CacheAware Suffix Selection.
paragraph�[Time complexity.]Time complexity. From the above discussion, we spend time for phase transition . We present a charging scheme to pay for that. works come again into play for an amortized cost analysis. Suppose that, in phase , we initially assign each suffix two kinds of credits to be charged as follows: credits of the first kind when becomes inactive, and further credits of the second kind when is already inactive but its work becomes the terminator of the prospective work of an active suffix. Note that is incapsulated by the prospective work of that suffix (which survives and becomes part of ).
Now, when executing bfprt on as mentioned above, we have that at most one prospective work survives in each collision and the corresponding suffix becomes part of . We therefore charge the cost as follows. We take credits of the first kind from the active suffixes that become inactive at the end of the phase transition. We also take credits from the inactive suffixes whose work terminates the prospective work of the survivors. In our example, the credits are taken from , and , while credits are taken from and .
At this point, it should be clear that, in our example, the next phase transition looks for the th smaller suffix in by executing bfprt in time on the prospective works built with the runs of consecutive occurrences of the work into . We thus identify (with ) as the median suffix in .
paragraph�[Phase transition for ]Phase transition for
We are now ready to describe the generic phase transition more formally in terms of the active suffixes in and the inactive ones in , where .
The input for the phase transition is the following: (a) the current prefix of the (unknown) th lexicographically smallest suffix in ; (b) the set of currently active suffixes; (c) the number of suffixes in whose work is smaller than that of the suffixes in (hence, we have to find the th smallest suffix in ); and (d) a Boolean vector whose th element is false (resp., true) iff, for suffix , the algorithm bfprt has determined that its work is smaller (resp., larger) than . The output of the phase transition are data (a)–(d) above, updated for phase .
We now define collisions and prospective works in a formal way. We say that two suffixes collide if their works and are adjacent as substrings in , namely, . A collision is the maximal subsequence , such that , where the active suffixes and collide for any . For our algorithm, a collision can also be a degenerate sequence of just one active suffix (since its work does not collide with that of any other active suffix).
The prospective work of a suffix , denoted by , is defined as follows. Consider the collision to which belongs. Suppose that is the th active suffix (from the left) in , that is, . Consider the suffix adjacent to (because of the definition of collision, must be an inactive suffix following ). We define the prospective work of , to be the string . Note that since their corresponding suffixes are all active, while is shorter. In other words, , with .
Lemma 0.1.
For any two suffixes , we can compare their prospective works and in time.
We now give the steps for the phase transition. Note that we can maintain in monotone order of suffix position (i.e., implies that comes first than in ).

Scan the active set and identify its collisions and the set containing all the suffixes such that immediately follows a collision. For any suffix in , determine its prospective work using the collisions and .

Apply algorithm bfprt to the set using the constanttime comparison as stated in Lemma 0.1. In this way, find the th lexicographically smallest prospective work , and the corresponding set of active suffixes whose prospective works match .

If , stop the computation and return the singleton as the th smallest suffix in .

If , set (and update accordingly).


For each : Let be its prospective work, where . Set its new work to be .

For each , leave its work unchanged and, as a byproduct of running bfprt in step 2, update position of the Boolean vector (d) given in input, so as to record the fact that is lexicographically smaller or larger than .
Lemma 0.2.
Executing phase transition with , requires time in the worst case.
paragraph�[Invariants for phase ]Invariants for phase
Before proving the correctness and the complexity of our algorithm, we need to establish some invariants that are maintained through the phase transitions. We say that is maximal if there does not exist another suffix such that contains , namely, such that and . For any , the following invariants holds (where is trivially the set of all the suffixes):

[prefixes]: and are prefixes of the (unknown) th smallest suffix of , and .

[works]: For any suffix , its work is either degenerate (a single mismatching symbol) or for a phase . Moreover, iff .

[comparing]: For any and , implies that we know whether or .

[nesting]: For any two suffixes and , their works and do not overlap (either they are disjoint or one is contained within the other). Namely, implies or .

[covering]: The works of the active suffixes are all maximal and, together with the maximal works generated by the inactive suffixes, form a nonoverlapping covering of (i.e. , where and either , or and is maximal, for ).
Theorem 0.4.
The algorithm terminates in a phase , and returns the th lexicographically smallest suffix.
Theorem 0.5.
Our suffix selection algorithm requires time in the worst case.
This simpler suffix selection algorithm is still cache “unfriendly”. For example, it requires block transfers with a string with period length (if is a prefix of for some integer , then is a period of ).
section1[CacheAware Sparse Suffix Selection]CacheAware Sparse Suffix Selection
In the sparse suffix selection problem, along with the string and the rank of the suffix to retrieve, we are also given a set of suffixes such that and the th smallest suffix belongs in . We want to find the wanted suffix in block transfers using the ideas of the algorithm described in Section Optimal CacheAware Suffix Selection.
Consider first a particular situation in which the suffixes are equally spaced positions each other. We can split into blocks of size , so that is conceptually a string of metacharacters and each suffix starts with a metacharacters. This is a fortunate situation since we can apply the algorithm described in Section Optimal CacheAware Suffix Selection as is, and solve the problem in the claimed bound. The nontrivial case is when the suffixes can be in arbitrary positions.
Hence, we revisit the algorithm described in Section Optimal CacheAware Suffix Selection to make it more cache efficient. Instead of trying to extend the work of an active suffix by just using the works of the following inactive suffixes, we try to batch these works in a sufficiently long segment, called reach. Intuitively, in a step similar to step 2 of the algorithm in Section Optimal CacheAware Suffix Selection, we could first apply the bfprt algorithm to the set of reaches. Then, after we select a subset of equal reaches, and the corresponding subset of active suffixes, we could extend their works using their reaches. This could cause collisions between the suffixes and they could be managed in a way similar to what we did in Section Optimal CacheAware Suffix Selection. This yields the notion of superphase transition.
paragraph�[Superphase transition]Superphase transition
The purpose of a superphase is to group consecutive phases together, so that we maintain the same invariants as those defined in Section Optimal CacheAware Suffix Selection. However, we need further concepts to describe the transition between superphases. We number the superphases according to the numbering of phases. We call a superphase if the first phase in it is (in the overall numbering of phases).
paragraph�[Reaches, pseudocollisions and prospective reaches.]Reaches, pseudocollisions and prospective reaches. Consider a generic superphase . Recall that, by the invariant (v) in Section Optimal CacheAware Suffix Selection, the phase transitions maintain the string partitioned into maximal works. We need to define a way to access enough (but not too many) consecutive “lookahead” works following each active suffix, before running the superphase. Since some of these active suffixes will become inactive during the phases that form the superphase, we cannot prefetch too many such works (and we cannot predict which ones will be effectively needed). This idea of prefetching leads to the following notion.
For any active suffix , the reach of , denoted by , is the maximal sequence of consecutive works such that

and ;

and are adjacent and, for , and are adjacent in ;

if is the leftmost active suffix in , then .
We call a reach full if in condition (iii), namely, we do not meet an active suffix while loading the reach. Since we know how to compare two works, we also know how to compare any two reaches , seen as sequences of works. We have the following.
Lemma 0.6.
For any two reaches and , such that , we have that cannot be a prefix of .
Using reaches, we must possibly handle the collisions that may occur in an arbitrary phase that is internal to the current superphase. We therefore introduce a notion of collision for reaches that is called pseudocollision because it does not necessarily implies a collision.
For any two reaches such that , we say that and pseudocollide if and the last work of is itself (not just equal to ). Thus, the last work of is active and equal to and . Certainly, the fact that and pseudocollide during a superphase does not necessarily imply that the works and collide in one of its phases. A pseudocollision is a maximal sequence such that and pseudocollide, for any . For our algorithm, a degenerate pseudocollision is a sequence of just one reach.
Let us consider an active suffix and the pseudocollision to which belongs. Let us suppose that the pseudocollision is (i.e. is the th reach). Also, let us consider the reach of the last work that appears in (by the definition of pseudocollision, we know that the last work of is equal to its first work, so is active and has a reach). The prospective reach of an active work , denoted by , is the sequence , where is the tail of and denotes the longest initial sequence of works that is common to both and . Analogously to prospective works, we can define a total order on the prospective reaches. The multiplicity of , denoted by , is (that is the number of reaches following in the pseudocollision plus ).
Lemma 0.7.
If the invariants for the phases hold for the current superphase then, for any two reaches and such that , we have that their prospective reaches and can be compared in time, provided we know the lengths of and .
paragraph�[Superphase transition .]Superphase transition . The transition from a superphase to the next superphase emulates what happens with phases in the algorithm of Section Optimal CacheAware Suffix Selection, but using block transfers.

For each active suffix , we create a pointer to its reach .

We find the th lexicographically smallest reach using bfprt on the pointers to reaches created in the previous step. The sets is active and , is active and , and is active and are thus identified, and, for any , the length of .^{2}^{2}2Given strings and , their longest common prefix is longest string such that both and start with . If , we stop and return , such that , as the th smallest suffix in .

For any , we compute its prospective reach .

We find the th lexicographically smallest prospective reach among the ones in , thus obtaining , , , and, for any , the length of . If , we stop and return , such that , as the th smallest suffix in .
Theorem 0.8.
The sparse suffix selection problem can be solved using block transfers in the worst case.
section1[Optimal CacheAware Suffix Selection]Optimal CacheAware Suffix Selection
The approach in Sec. Optimal CacheAware Suffix Selection does not work if the number of input active suffixes is . The process would cost block transfers (since it would take transitions to finally have active suffixes left). However, if we were able to find a set of suffixes such that one of them is the th smallest, we could solve the problem with block transfers using the algorithm in Sec. Optimal CacheAware Suffix Selection. In this section we show how to compute such a set .
Basically, we consider all the substrings of length of and we select a suitable set of pivot substrings that are roughly evenly spaced. Then, we find the pivot that is lexicographically “closest” to the wanted th and one of the following two situations arises:
We are able to infer that the th smallest suffix is strictly between two consecutive pivots (that is its corresponding substring of characters is strictly greater and smaller of the two pivots). In this case, we return all the suffixes that are contained between the two pivots.
We can identify the suffixes that have the first characters equal to those of the th smallest suffix. We show that, in case they are still in number, they must satisfy some periodicity property, so that we can reduce them to just with additional block transfers.
subsection2[Finding pivots and the key suffixes]Finding pivots and the key suffixes
Let , for a suitable constant . We proceed with the following steps.
First. We sort the first substrings of length of (that is substrings , ,…, , ). Then we sort the second substrings of length and so forth until all the positions in have been considered. The product of this step is an array of pointers to the substrings of length of .
Second. We scan and we collect in an array of positions the pointers .
Third. We (multi)select from the pointers to the substrings (of length ) such that has rank among the substrings (pointed by the pointers) in . These are the pivots we were looking for. We store the (pointers to the) pivots in an array .
Fourth. We need to find the rightmost pivot such that the number of substrings (of length of ) lexicographically smaller than is less than (the rank of the wanted suffix). We cannot simply distribute all the substrings of length according to all the pivots in , because it would be too costly. Instead, we proceed with the following refining strategy.

From the pivots in we extract the group of equidistant pivots, where is a suitable constant, (i.e. the pivots , where ). Then, for any , we find out how many substrings of size are lexicographically smaller than . After that we find the rightmost pivot such that the number of substrings (of length ) smaller than is less than .

From the pivots in following we extract the group of equidistant pivots. Then, for any , we find out how many substrings of size are smaller than . After that we find the rightmost pivot such that the number of substrings smaller than is less than .
More generally:

Let be the pivots in following . Then, for any , we find out how many substrings of size are smaller than . After that we find the rightmost pivot such that the number of substrings smaller than is less than .
The pivot found in the last iteration is the pivot we are looking for in this step.
Fifth. We scan and compute the following two numbers: the number of substrings of length lexicographically smaller than ; the number of substrings equal to .
Sixth. In this step we treat the following case: . More specifically, this implies that the wanted th smallest suffix has its prefix of characters equal to . We proceed as follows. We scan and gather in a contiguous zone (the indexes of) the suffixes of having their prefixes of characters equal to . In this case we have already found the key suffixes (whose indexes reside in ). Therefore the computation in this section ends here and we proceed to discard some of them (sec. Optimal CacheAware Suffix Selection).
Seventh. In this step we treat the following remaining case: . In other words, in this case we know that the prefix of characters of the wanted th smallest suffix is (lexicographically) greater than and smaller than . Therefore, we scan and gather in a contiguous zone (the indexes of) the suffixes of having their prefix of characters greater than and smaller than . Since there are less than such suffixes (see below Lemma 0.9), we have already found the set of sparse active suffixes (whose indexes reside in ) that will be processed in Sec. Optimal CacheAware Suffix Selection.
Lemma 0.9.
For any and , either the number of key suffixes found is , or their prefixes of characters are all the same.
Lemma 0.10.
Under the tallcache assumption, finding the key suffixes needs block transfers in the worst case.
subsection2[Discarding key suffixes]Discarding key suffixes Finally, let us show how to reduce the number of key suffixes gathered in Sec. Optimal CacheAware Suffix Selection to so that we can pass them to the sparse suffix selection algorithm (Sec. Optimal CacheAware Suffix Selection). Let us assume that the number of key suffixes is greater than .
The indexes of the key suffixes have been previously stored in an array . Clearly, the th smallest suffix is among the ones in . We also know the number of suffixes of that are lexicographically smaller than each suffix in . Finally, we know that there exists a string of length such that contains all and only the suffixes such that the prefix of length of is equal to (i.e. contains the indexes of all the occurrences of in ).
To achieve our goal we exploit the possible periodicity of the string . A string is a period of a string () if is a prefix of for some integer . The period of is the smallest of its periods. We exploit the following:
Property 1 ([8]).
If occurs in two positions and of and then has a period of length .
Let be the period of . Since the number of suffixes in is greater than , there must be some overlapping between the occurrences of in . Therefore, by Property 1, we can conclude that . For the sake of presentation let us assume that is not a multiple of (the other case is analogous).
From how has been built (by left to right scanning of ) we know that the indexes in it are in increasing order, that is , for any (i.e. the indexes in follow the order, from left to right, in which the corresponding suffixes may be found in ). Let us consider a maximal subsequence of such that, for any , (i.e. the occurrence of in starting in position overlaps the one starting in position by at least positions). Clearly, any two of these subsequences of do not overlap and hence can be seen as the concatenation of these subsequences. From the definition of the partitioning of and from the periodicity of we have:
Lemma 0.11.
The following statements hold:

There are less than such subsequences.

For any , the substring (the substring of spanned by the substrings whose indexes are in ) has period .

The substring of length of starting in position (the substring starting one periodlength past the rightmost member of ) is not equal to .
For any key suffix , let us consider the following prefix: , where is the subsequence of where (the index of) belongs to. By Lemma 0.11, we know two things about : the prefix of length of has period ; the suffix of length of is not equal to .
In light of this, we associate with any key suffix a pair of integers defined as follows: is equal to the number of complete periods in the prefix of length of ; is equal to (that is the index of the substring of length starting one periodlength past the rightmost member of ).
There is natural total order that can be defined over the key suffixes. It is based on the pairs of integers and it is defined as follow. For any two key suffixes :

If then and are equal (according to ).

If then iff is lexicographically smaller than .
By Lemma 0.11, we know that the suffix of length of (which is the substring ) is not equal to . Therefore the total order is well defined.
We are now ready to describe the process for reducing the number of key suffixes. We proceed with the following steps.
First. By scanning and , we compute the pair for any key suffix . The pairs are stored in an array (of pairs of integers) .
Second. We scan and compute the array of positions defined as follows: for any , is equal to , or if is less than, equal to or greater than , respectively (the array tells us what is the result of the comparison of with any substring of size different from it).
Third. By scanning and at the same time, we compute the array of size , such that, for any , (where is the second member of the pair of integers in position of ).
Fourth. Using and , we select the th smallest key suffix and all the key suffixes equal to according to the total order (where is the number of suffixes of that are lexicographically smaller than each suffix in , known since Sec. Optimal CacheAware Suffix Selection). The set of the selected key suffixes is the output of the process.
Lemma 0.12.
At the end of the discarding process, the selected key suffixes are less than in number and the th lexicographically smallest suffix is among them.
Lemma 0.13.
The discarding process requires block transfers at the worst case.
Theorem 0.14.
The suffix selection problem for a string defined over a general alphabet can be solved using block transfers in the worst case.
References
 [1] A. Aggarwal and J. Vitter. The input/output complexity of sorting and related problems. In Communications of ACM, 1988.
 [2] M. Blum, R. W. Floyd, V. Pratt, R. L. Rivest, and R. E. Tarjan. Time bounds for selection. J. Comput. System Sci., 7:448–61, 1973.
 [3] David Dobkin and J. Ian Munro. Optimal time minimal space selection algorithms. Journal of the ACM, 28(3):454–461, July 1981.
 [4] M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symp. on Foundations of Computer Science (FOCS), pages 137–143. IEEE, 1997.
 [5] M. Farach, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In Proc. 39th Annual Symp. on Foundations of Computer Science (FOCS). IEEE, 1998.
 [6] G. Franceschini and S. Muthukrishnan. Optimal suffix selection. In Proceedings of the 39th ACM Symposium on Theory of Computing (STOC), 2007.
 [7] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cacheoblivious algorithms. In Proc. 40th Annual Symp. on Foundations of Computer Science (FOCS 1999), pages 285–297. IEEE, 1999.
 [8] Z. Galil. Optimal parallel algorithms for string matching. Inf. Control, 67(13):144–157, 1985.
 [9] E. M. McCreight. A spaceeconomical suffix tree construction algorithm. J. ACM, 23(2):262–272, 1976.
 [10] J. Vitter. In Algorithms and Data Structures for External Memory, 2007.
 [11] P. Weiner. Linear pattern matching algorithms. In Foundations of Computer Science (FOCS), 1973.