On the number of gapped repeats with arbitrary gap

# On the number of gapped repeats with arbitrary gap

Roman Kolpakov
Lomonosov Moscow State University,
Dorodnicyn Computing Centre FRC CSC RAS,
Moscow, Russia
Email: foroman@mail.ru
###### Abstract

For any functions , from to we call repeats such that as -gapped repeats. We study the possible number of -gapped repeats in words of fixed length . For quite weak conditions on , we obtain an upper bound on this number which is linear to .

## 1 Introduction

Let be an arbitrary word of length . A fragment of , where , is called a factor of  and is denoted by . Note that this factor can be considered either as a word itself or as the fragment of . So for factors we have two different notions of equality: factors can be equal as the same fragment of the word  or as the same word. To avoid this ambiguity, we use two different notations: if two factors and of  are the same word (the same fragment of ) we will write (). For any the factor () is called a prefix (a suffix) of . By positions in  we mean the order numbers of letters of the word . For any factor  of  the positions and are called start position of  and end position of  and denoted by and respectively. For any two factors , of  the factor is contained in  if and . If some word is equal to a factor  of  then is called an occurrence of  in .

We denote by the minimal period of a word and by the ratio which is called the exponent of . A word is called primitive if its exponent is not an integer greater than 1. A word is called periodic if its exponent is greater than or equal to 2. Occurrences of periodic words are called repetitions. Repetitions are fundamental objects, due to their primary importance in word combinatorics [21] as well as in various applications, such as string matching algorithms [12, 5], molecular biology [13], or text compression [22]. The simplest and best known example of repetitions is factors of the form , where is a nonempty word. Such repetitions are called squares. A square is called primitive if is primitive. The questions on the number of squares and effective searching of squares in words are well studied in the literature (see, e.g., [5, 4, 14]).

A repetition in a word is called maximal if this repetition cannot be extended to the left or to the right in the word by at least one letter with preserving its minimal period. More precisely, a repetition in  is called maximal if it satisfies the following conditions:

1. if , then ,

2. if , then .

Maximal repetitions are usually called runs in the literature. Since runs contain all the other repetitions in a word, the set of all runs can be considered as a compact encoding of all repetitions in the word which has many useful applications (see, for example, [7]). For any word  we will denote by the sum of exponents of all maximal repetitions in . The following bound for is proved in [15].

###### Theorem 1

for any .

More precise upper bounds on were obtained in [6, 8, 2].

A natural generalization of squares is factors of the form where and are nonempty words. We call such factors gapped repeats. In the gapped repeat the first (second) factor  is called the left (right) copy, and is called the gap. By the period of this gapped repeat we will mean the value . For a gapped repeat  we denote the length of copies of  by and the period of  by (see Fig. 1).

By we will denote the gapped repeat with the left copy  and the right copy . Not that gapped repeats may form the same segment but have different periods. Such repeats are considered as distinct, i.e. a gapped repeat is not specified by its start and end positions in the word because these positions are not sufficient for determining the both copies and the gap of the repeat. Analogously to repetitions, a gapped repeat in  is called maximal if it satisfies the following conditions:

1. if , then ,

2. if , then .

In other words, a gapped repeat in a word is maximal if its copies cannot be extended to the left or to the right in the word by at least one letter with preserving its period (see Fig. 2).

Let , be functions from to such that for any . We call a gapped repeat -gapped repeat if . To our knowledge, maximal -gapped repeats were firstly investigated in [3] where it was shown that for computed in constant time functions , all maximal -gapped can be found in a word of length  with time complexity where is the size of output. An algorithm for finding in a word all gapped repeats with a fixed gap length in time where is the gap length, is the word length, and is the size of output was proposed in [16]. -gapped repeats is a natural generalization of gapped repeats  such that for some . Such gapped repeats which can be considered as a particular case of -gapped repeats for and are called -gapped repeats. The notion of -gapped repeats was introduced in [19] where it was proved that the number of maximal -gapped repeats in a word of length  is bounded by and all maximal -gapped repeats can be found in time for the case of integer alphabet. A new approach to computing -gapped repeats was proposed in [10] in [10] where it was shown that the longest -gapped repeat in a word of length  over an integer alphabet can be found in time. In [23] an algorithm using an approach previously introduced in [1] is proposed for finding all maximal -gapped repeats in time where is the output size, for a constant-size alphabet. Finally, in [9, 11] an asymptotically tight bound on the number of maximal -gapped repeats in a word of length  was independently proved and, moreover, algorithms for finding of all maximal -gapped repeats in time were proposed.

For any real  denote

 |x|+={x, if x>0;0, otherwise; |x|−={−x, if x<0;0, otherwise;

Let be a function from to . For each denote and . Denote also () by () if this supremum exists. Let , be two function from to such that for any . If the both values and exist denote by . If the both values and exist denote by . Let either or exist. Then we define as if the both values , exist; otherwise we define as the existing one from the values , . Denote also for each and if this supremum exists. In the paper we prove bound on the number of maximal -gapped repeats in a word of length .

## 2 Auxiliary definitions and results

Further we will consider an arbitrary word of length . We will use the following quite evident fact on maximal repetitions (see, e.g., [17][Lemma 8.1.3]).

###### Lemma 1

Two distinct maximal repetitions with the same minimal period  can not have an overlap of length greater than or equal to .

It is not difficult also to prove this fact (see,e.g, [20][Proposition 1]).

###### Proposition 1

If a square is primitive, for any two distinct occurrences and of in  the inequality holds.

Since any repetition  contains as a prefix a primitive square with the period , Proposition 1 easily implies

###### Corollary 1

For any two distinct occurrences and of the same repetition  in  the inequality holds.

For obtaining our bound on the number of considered repeats, we use the following classification of maximal gapped repeats introduced in [20]. We say that a maximal gapped repeat is periodic if the copies of this repeat are repetitions. The set of all periodic -gapped repeats in the word  is denoted by . A maximal gapped repeat is called prefix (suffix) semiperiodic if the copies of this repeat are not repetitions, but these copies have a periodic prefix (suffix) which length is not less than the half of the copies length. The longest periodic prefix in a copy of a prefix semiperiodic repeat is called the periodic prefix of this copy. The set of all prefix (suffix) semiperiodic -gapped repeats in the word  is denoted by  (). A maximal gapped repeat is called semiperiodic if it is either prefix or suffixsemiperiodic. The set of all semiperiodic -gapped repeats in the word  is denoted by . Maximal gapped repeats which are neither periodic nor semiperiodic are called ordinary. The set of all ordinary -gapped repeats in the word  is denoted by .

## 3 Estimation of maximal f,g-gapped repeats

Further we assume that both the values , exist. First we estimate the number of periodic maximal -gapped repeats in . Let be a repeat from . Then the both copies , of are repetitions in  which are extended respectively to some maximal repetitions , with the same minimal period in . If and are the same repetition  then we call private repeat. Otherwise is called non-private. The following bound on the number of private repeats is proved in [20][Corollary 4].

###### Proposition 2

The number of private repeats in is .

Let be a non-private repeat from , i.e. and are distinct repetitions. We will say that is generated from left by  (generated from right by ) if (). We will say also that is generated by a repetition  if is generated from left or from right by . Let be the minimal period of and . Note that if and then

 w[beg(u′)−1]=w[beg(u′)+p−1]=w[beg(u′′)+p−1]=w[beg(u′′)−1]

which contradicts that is maximal. So either or . In an analogous way we have that either or . Let be generated by the repetition (). If and (or and ) then, using above observations, we have (), so () which contradicts that is generated by (). Thus the only three following cases are possible.

1. and ( and );

2. and ( and );

3. ().

We will say that is prefixly generated by () in case 1, suffixly generated by () in case 2, and totally generated by () in case 3. We denote the sets of all prefixly generated, suffixly generated and totally generated repeats from by , and respectively. Thus, any non-private repeat from belongs to one of the sets , , . We estimate separately the numbers of repeats in these sets.

###### Lemma 2

Any maximal repetition  in  generates repeates from .

Proof. Note that for any two repeats , from which are totally generated by  from right the restrictions

 end(r)+g(|r|)+1≤beg(u′′1),beg(u′′2)≤end(r)+f(|r|)+1

hold. Moreover, by Corollary 1, we have . Thus, the number of such repeats can not be greater than

 1+f(|r|)−g(|r|)p(r)≤1+|r|Δf,g(|r|)p(r)=1+e(r)Δf,g(|r|)≤1+e(r)Δf,g.

By analogous way we obtain that the number of repeats from which are totally generated by  from left is also not greater than , so generates no more than repeats from .

From Lemma 2, using Theorem 1, we obtain the following bound on .

###### Corollary 2

.

Now we estimate . Let be a repetition in  which prefixly generates some repeat  from , i.e. one of copies of  is contained in , and the other copy is contained in another repetition  with the same minimal period. Father we will say that is generated by  with the repetition .

###### Proposition 3

Let be two repetitions in . Then prefixly generates with  less than repeats.

Proof. We consider the case , i.e. the case of repeats generated by  with  from left (the case is considered analogously). Let , be two such repeats, i.e. , are prefixes of  and , are suffixes of . Note that in this case repeats , are uniquely defined by the respective positions , . Since , are repetitions with the minimal period , i.e. , we note that

 w[end(u1)−2p(r)+1..end(u1)] = w[end(r′)−2p(r)+1..end(r′)] = w[end(u2)−2p(r)+1..end(u2)],

i.e. and are equal primitive squares with the period . Hence, by Proposition 1, we have . Moreover, since , are prefixes of , the restrictions

 beg(r)+2p(r)−1≤end(u1),end(u2)≤end(r)

hold. Thus the number of considered repeats is bounded by

 1+|r|−2p(r)p(r)<|r|p(r)=e(r).

Let be a repeat from prefixly generated from left by a maximal repetition , and be the gap of . Note that

 beg(u′′)=beg(u′)+|u′|+|v|=beg(r)+|u′|+|v|.

So, since , we have

 beg(r)+|u′|+g(|u′|)≤beg(u′′)≤beg(r)+|u′|+f(|u′|).

Thus, taking into account , we obtain

 beg(r)+min2p(r)≤x<|r|(x+g(x))≤beg(u′′)≤beg(r)+max2p(r)≤x<|r|(x+f(x)).

We conlude this fact, denoting by and by .

###### Proposition 4

Let be a repeat from  prefixly generated from left by a maximal repetition . Then

 lbg(r)≤beg(u′′)≤ubf(r).

For any maximal repetition  in  denote by the set of all maximal repetitions  such that prefixly generates from left with  at least one repeat from . Note that all repetitions from have the minimal period .

###### Proposition 5

Let be a maximal repetition in . Then for any repetition  from except, perhaps, one repetition the conditions

 lbg(r)≤beg(r′)≤ubf(r)

hold.

Proof. Let be a repetition from , and be a repeat prefixly generated from left by  with . By Proposition 4 we obtain , so since is contained in . Thus for any repetition  from we have . Now let , be two different repetitions from such that , and , be repeats prefixly generated from left by  with  and  respectively. Without loss of generality we assume that , so . On the other hand, by Proposition 4 we have , so . Thus is contained in , so is contained in the overlap of and . Recall that is a repetition with the minimal period , so . Therefore, and have the overlap of length greater than which contradicts Lemma 1. Thus no more than one repetition  from can satisfy the inequality .

###### Proposition 6

For any maximal repetition  in  the bound

 ubf(r)−lbg(r)<|r|(1+Δf,g+∂f,g)

holds.

Proof. Let be such that and , and be such that and . Then we have

 ubf(r)−lbg(r)=(x∗−x∗)+(f(x∗)−g(x∗))<|r|+(f(x∗)−g(x∗)).

Without loss of generality we will assume that (the case of is considered analogously). We consider separately two possible cases for the value .

a) Let . Then

 f(x∗)−g(x∗)≤(f(x∗)−g(x∗))+|f(x∗)−f(x∗)|+

where

 f(x∗)−g(x∗)=x∗⋅Δf,g(x∗)<|r|Δf,g

and

 |f(x∗)−f(x∗)|+≤∂+f(x∗−x∗)≤∂af,g(x∗−x∗)<|r|∂af,g=|r|∂f,g.

Thus , so

 ubf(r)−lbg(r)<|r|(1+Δf,g+∂f,g).

b) Let . Then

 f(x∗)−g(x∗)≤(f(x∗)−g(x∗))+|g(x∗)−g(x∗)|+

where

 f(x∗)−g(x∗)=x∗⋅Δf,g(x∗)<|r|Δf,g

and

 |g(x∗)−g(x∗)|+≤∂+g(x∗−x∗)≤∂bf,g(x∗−x∗)<|r|∂bf,g=|r|∂f,g.

Thus , so

 ubf(r)−lbg(r)<|r|(1+Δf,g+∂f,g).
###### Corollary 3

For any maximal repetition  in  the bound is valid.

Proof. Let be two maximal repetitions from such that . Since and have the same minimal period and by the definition of , by Lemma 1 we note that the overlap of and is less than , so

 beg(r′′)−beg(r′)>|r′|−p(r)≥|r|−p(r)≥|r|/2.

Therefore, we conclude that the number of repetitions from such that

 lbg(r)≤beg(r′)≤ubf(r)

is not greater than which is less that

 1+|r|(1+Δf,g+∂f,g)|r|/2=O(1+Δf,g+∂f,g)

by Proposition 6. Thus we obtain that by Proposition 5.

###### Lemma 3

The number of generated from left repeats from is .

Proof. It follows immediately from Proposition 3 and Corollary 3 that any maximal repetition  in  generates from left repeats from , so bound in the lemma statement follows from Theorem 1.

In an analogous way we can prove that the number of generated from right repeats from is also . So we obtain the following bound on .

###### Corollary 4

.

In an analogous way we can also prove that . Thus, using Proposition 2 and Corollaries 2 and 4, we obtain the following bound on .

###### Corollary 5

.

Now we estimate the number of semiperiodic maximal -gapped repeats in . Let be a repeat from . Denote by () the periodic prefixes of (). Note that these prefixes are extended respectively to some distinct maximal repetitions , with the same minimal period such that , . We will say that is generated from left by  with  (generated from right by  with ) if (). Let be generated from left by  with . Note that

 beg(u′′)=end(π′)+1+(|u′|−|π′|)+|v|=end(r′)+1+(|u′|−|π′|)+|v|,

where and . Note also that . Therefore,

 1+(|u′|−|π′|)+|v|≥2+g(|u′|)≥2+min1

and

 1+(|u′|−|π′|)+|v|≤1+|r′|+f(|u′|)≤1+|r′|+max1

Denoting by and by , we obtain the following fact.

###### Proposition 7

Let be a repeat from generated from left by a maximal repetition . Then

 lpg(r)≤beg(u′′)≤upf(r).

For any maximal repetition  in  denote by the set of all maximal repetitions  such that generates from left with  at least one repeat from . Note that all repetitions from have the minimal period . Analogously to Proposition 5, we can prove the following statement.

###### Proposition 8

Let be a maximal repetition in . Then for any repetition  from except, perhaps, one repetition the conditions

 lpg(r)≤beg(r′)≤upf(r)

hold.

###### Proposition 9

For any maximal repetition  in  the bound

 upf(r)−lpg(r)<|r|(1+2Δf,g+2∂f,g)

holds.

Proof. Let , , , be such that and . Then we have

 upf(r)−lpg(r)=|r|−1+(f(x∗)−g(x∗))<|r|+(f(x∗)−g(x∗)).

Without loss of generality we will assume that (the case of is considered analogously). We consider separately two possible cases for the value .

a) Let . Then

 f(x∗)−g(x∗)≤(f(x∗)−g(x∗))+(f(x∗)−f(x∗))

where

 f(x∗)−g(x∗)=x∗⋅Δf,g(x∗)≤2|r|Δf,g

and

 f(x∗)−f(x∗)≤∂+f(x∗−x∗)≤∂af,g(x∗−x∗)<2|r|∂af,g=2|r|∂f,g.

Thus