On the number of gapped repeats with arbitrary gap
Abstract
For any functions , from to we call repeats such that as gapped repeats. We study the possible number of gapped repeats in words of fixed length . For quite weak conditions on , we obtain an upper bound on this number which is linear to .
1 Introduction
Let be an arbitrary word of length . A fragment of , where , is called a factor of and is denoted by . Note that this factor can be considered either as a word itself or as the fragment of . So for factors we have two different notions of equality: factors can be equal as the same fragment of the word or as the same word. To avoid this ambiguity, we use two different notations: if two factors and of are the same word (the same fragment of ) we will write (). For any the factor () is called a prefix (a suffix) of . By positions in we mean the order numbers of letters of the word . For any factor of the positions and are called start position of and end position of and denoted by and respectively. For any two factors , of the factor is contained in if and . If some word is equal to a factor of then is called an occurrence of in .
We denote by the minimal period of a word and by the ratio which is called the exponent of . A word is called primitive if its exponent is not an integer greater than 1. A word is called periodic if its exponent is greater than or equal to 2. Occurrences of periodic words are called repetitions. Repetitions are fundamental objects, due to their primary importance in word combinatorics [21] as well as in various applications, such as string matching algorithms [12, 5], molecular biology [13], or text compression [22]. The simplest and best known example of repetitions is factors of the form , where is a nonempty word. Such repetitions are called squares. A square is called primitive if is primitive. The questions on the number of squares and effective searching of squares in words are well studied in the literature (see, e.g., [5, 4, 14]).
A repetition in a word is called maximal if this repetition cannot be extended to the left or to the right in the word by at least one letter with preserving its minimal period. More precisely, a repetition in is called maximal if it satisfies the following conditions:

if , then ,

if , then .
Maximal repetitions are usually called runs in the literature. Since runs contain all the other repetitions in a word, the set of all runs can be considered as a compact encoding of all repetitions in the word which has many useful applications (see, for example, [7]). For any word we will denote by the sum of exponents of all maximal repetitions in . The following bound for is proved in [15].
Theorem 1
for any .
A natural generalization of squares is factors of the form where and are nonempty words. We call such factors gapped repeats. In the gapped repeat the first (second) factor is called the left (right) copy, and is called the gap. By the period of this gapped repeat we will mean the value . For a gapped repeat we denote the length of copies of by and the period of by (see Fig. 1).
By we will denote the gapped repeat with the left copy and the right copy . Not that gapped repeats may form the same segment but have different periods. Such repeats are considered as distinct, i.e. a gapped repeat is not specified by its start and end positions in the word because these positions are not sufficient for determining the both copies and the gap of the repeat. Analogously to repetitions, a gapped repeat in is called maximal if it satisfies the following conditions:

if , then ,

if , then .
In other words, a gapped repeat in a word is maximal if its copies cannot be extended to the left or to the right in the word by at least one letter with preserving its period (see Fig. 2).
Let , be functions from to such that for any . We call a gapped repeat gapped repeat if . To our knowledge, maximal gapped repeats were firstly investigated in [3] where it was shown that for computed in constant time functions , all maximal gapped can be found in a word of length with time complexity where is the size of output. An algorithm for finding in a word all gapped repeats with a fixed gap length in time where is the gap length, is the word length, and is the size of output was proposed in [16]. gapped repeats is a natural generalization of gapped repeats such that for some . Such gapped repeats which can be considered as a particular case of gapped repeats for and are called gapped repeats. The notion of gapped repeats was introduced in [19] where it was proved that the number of maximal gapped repeats in a word of length is bounded by and all maximal gapped repeats can be found in time for the case of integer alphabet. A new approach to computing gapped repeats was proposed in [10] in [10] where it was shown that the longest gapped repeat in a word of length over an integer alphabet can be found in time. In [23] an algorithm using an approach previously introduced in [1] is proposed for finding all maximal gapped repeats in time where is the output size, for a constantsize alphabet. Finally, in [9, 11] an asymptotically tight bound on the number of maximal gapped repeats in a word of length was independently proved and, moreover, algorithms for finding of all maximal gapped repeats in time were proposed.
For any real denote
Let be a function from to . For each denote and . Denote also () by () if this supremum exists. Let , be two function from to such that for any . If the both values and exist denote by . If the both values and exist denote by . Let either or exist. Then we define as if the both values , exist; otherwise we define as the existing one from the values , . Denote also for each and if this supremum exists. In the paper we prove bound on the number of maximal gapped repeats in a word of length .
2 Auxiliary definitions and results
Further we will consider an arbitrary word of length . We will use the following quite evident fact on maximal repetitions (see, e.g., [17][Lemma 8.1.3]).
Lemma 1
Two distinct maximal repetitions with the same minimal period can not have an overlap of length greater than or equal to .
It is not difficult also to prove this fact (see,e.g, [20][Proposition 1]).
Proposition 1
If a square is primitive, for any two distinct occurrences and of in the inequality holds.
Since any repetition contains as a prefix a primitive square with the period , Proposition 1 easily implies
Corollary 1
For any two distinct occurrences and of the same repetition in the inequality holds.
For obtaining our bound on the number of considered repeats, we use the following classification of maximal gapped repeats introduced in [20]. We say that a maximal gapped repeat is periodic if the copies of this repeat are repetitions. The set of all periodic gapped repeats in the word is denoted by . A maximal gapped repeat is called prefix (suffix) semiperiodic if the copies of this repeat are not repetitions, but these copies have a periodic prefix (suffix) which length is not less than the half of the copies length. The longest periodic prefix in a copy of a prefix semiperiodic repeat is called the periodic prefix of this copy. The set of all prefix (suffix) semiperiodic gapped repeats in the word is denoted by (). A maximal gapped repeat is called semiperiodic if it is either prefix or suffixsemiperiodic. The set of all semiperiodic gapped repeats in the word is denoted by . Maximal gapped repeats which are neither periodic nor semiperiodic are called ordinary. The set of all ordinary gapped repeats in the word is denoted by .
3 Estimation of maximal gapped repeats
Further we assume that both the values , exist. First we estimate the number of periodic maximal gapped repeats in . Let be a repeat from . Then the both copies , of are repetitions in which are extended respectively to some maximal repetitions , with the same minimal period in . If and are the same repetition then we call private repeat. Otherwise is called nonprivate. The following bound on the number of private repeats is proved in [20][Corollary 4].
Proposition 2
The number of private repeats in is .
Let be a nonprivate repeat from , i.e. and are distinct repetitions. We will say that is generated from left by (generated from right by ) if (). We will say also that is generated by a repetition if is generated from left or from right by . Let be the minimal period of and . Note that if and then
which contradicts that is maximal. So either or . In an analogous way we have that either or . Let be generated by the repetition (). If and (or and ) then, using above observations, we have (), so () which contradicts that is generated by (). Thus the only three following cases are possible.

and ( and );

and ( and );

().
We will say that is prefixly generated by () in case 1, suffixly generated by () in case 2, and totally generated by () in case 3. We denote the sets of all prefixly generated, suffixly generated and totally generated repeats from by , and respectively. Thus, any nonprivate repeat from belongs to one of the sets , , . We estimate separately the numbers of repeats in these sets.
Lemma 2
Any maximal repetition in generates repeates from .
Proof. Note that for any two repeats , from which are totally generated by from right the restrictions
hold. Moreover, by Corollary 1, we have . Thus, the number of such repeats can not be greater than
By analogous way we obtain that the number of repeats from which are totally generated by from left is also not greater than , so generates no more than repeats from .
Corollary 2
.
Now we estimate . Let be a repetition in which prefixly generates some repeat from , i.e. one of copies of is contained in , and the other copy is contained in another repetition with the same minimal period. Father we will say that is generated by with the repetition .
Proposition 3
Let be two repetitions in . Then prefixly generates with less than repeats.
Proof. We consider the case , i.e. the case of repeats generated by with from left (the case is considered analogously). Let , be two such repeats, i.e. , are prefixes of and , are suffixes of . Note that in this case repeats , are uniquely defined by the respective positions , . Since , are repetitions with the minimal period , i.e. , we note that
i.e. and are equal primitive squares with the period . Hence, by Proposition 1, we have . Moreover, since , are prefixes of , the restrictions
hold. Thus the number of considered repeats is bounded by
Let be a repeat from prefixly generated from left by a maximal repetition , and be the gap of . Note that
So, since , we have
Thus, taking into account , we obtain
We conlude this fact, denoting by and by .
Proposition 4
Let be a repeat from prefixly generated from left by a maximal repetition . Then
For any maximal repetition in denote by the set of all maximal repetitions such that prefixly generates from left with at least one repeat from . Note that all repetitions from have the minimal period .
Proposition 5
Let be a maximal repetition in . Then for any repetition from except, perhaps, one repetition the conditions
hold.
Proof. Let be a repetition from , and be a repeat prefixly generated from left by with . By Proposition 4 we obtain , so since is contained in . Thus for any repetition from we have . Now let , be two different repetitions from such that , and , be repeats prefixly generated from left by with and respectively. Without loss of generality we assume that , so . On the other hand, by Proposition 4 we have , so . Thus is contained in , so is contained in the overlap of and . Recall that is a repetition with the minimal period , so . Therefore, and have the overlap of length greater than which contradicts Lemma 1. Thus no more than one repetition from can satisfy the inequality .
Proposition 6
For any maximal repetition in the bound
holds.
Proof. Let be such that and , and be such that and . Then we have
Without loss of generality we will assume that (the case of is considered analogously). We consider separately two possible cases for the value .
a) Let . Then
where
and
Thus , so
b) Let . Then
where
and
Thus , so
Corollary 3
For any maximal repetition in the bound is valid.
Proof. Let be two maximal repetitions from such that . Since and have the same minimal period and by the definition of , by Lemma 1 we note that the overlap of and is less than , so
Therefore, we conclude that the number of repetitions from such that
is not greater than which is less that
Lemma 3
The number of generated from left repeats from is .
Proof. It follows immediately from Proposition 3 and Corollary 3 that any maximal repetition in generates from left repeats from , so bound in the lemma statement follows from Theorem 1.
In an analogous way we can prove that the number of generated from right repeats from is also . So we obtain the following bound on .
Corollary 4
.
In an analogous way we can also prove that . Thus, using Proposition 2 and Corollaries 2 and 4, we obtain the following bound on .
Corollary 5
.
Now we estimate the number of semiperiodic maximal gapped repeats in . Let be a repeat from . Denote by () the periodic prefixes of (). Note that these prefixes are extended respectively to some distinct maximal repetitions , with the same minimal period such that , . We will say that is generated from left by with (generated from right by with ) if (). Let be generated from left by with . Note that
where and . Note also that . Therefore,
and
Denoting by and by , we obtain the following fact.
Proposition 7
Let be a repeat from generated from left by a maximal repetition . Then
For any maximal repetition in denote by the set of all maximal repetitions such that generates from left with at least one repeat from . Note that all repetitions from have the minimal period . Analogously to Proposition 5, we can prove the following statement.
Proposition 8
Let be a maximal repetition in . Then for any repetition from except, perhaps, one repetition the conditions
hold.
Proposition 9
For any maximal repetition in the bound
holds.
Proof. Let , , , be such that and . Then we have
Without loss of generality we will assume that (the case of is considered analogously). We consider separately two possible cases for the value .
a) Let . Then
where
and
Thus