Online Detection of Repetitions with Backtracking

Online Detection of Repetitions with Backtracking

Dmitry Kosolobov Ural Federal University, Ekaterinburg, Russia
11email: dkosolobov@mail.ru
Abstract

In this paper we present two algorithms for the following problem: given a string and a rational , detect in the online fashion the earliest occurrence of a repetition of exponent in the string.

1. The first algorithm supports the backtrack operation removing the last letter of the input string. This solution runs in time and space, where is the maximal length of a string generated during the execution of a given sequence of read and backtrack operations.

2. The second algorithm works in time and space, where is the length of the input string and is the number of distinct letters. This algorithm is relatively simple and requires much less memory than the previously known solution with the same working time and space.

Keywords:
repetition-free square-free, online algorithm, backtracking

1 Introduction

The study of algorithms analyzing different kinds of string periodicities forms an important branch of stringology. Repetitions of a given fixed order often play a central role in such investigations. We say that an integer is a period of if for some integer and strings and such that . Given a rational , a string such that for a period of is called an -repetition. A string is -repetition-free if it does not contain an -repetition as a substring. We consider algorithms recognizing -repetition-free strings for any fixed . To be more precise, we say that an algorithm detects -repetitions if it decides whether the input string is -repetition-free. Further, we say that this algorithm detects -repetitions online if it processes the input string sequentially from left to right and decides whether each prefix is -repetition-free after reading the rightmost letter of that prefix.

In this paper we give two algorithms that detect -repetitions online for a given fixed . The first one, which uses the ideas of the Apostolico-Breslauer algorithm [1], works on unordered alphabet and supports backtracking, the operation removing the last letter of the processed string. This solution requires time and space, where is the maximal length of a string generated during the execution of given backtrack and read operations. Slightly modifying the proof from [10], one can show that this time is the best possible in the case of unordered alphabet. The second algorithm works on ordered alphabet and requires time and linear space, where is the number of distinct letters in the input string and is the length of this string. Although this result does not theoretically outperform the previously known solution [6], it is significantly less complicated and can be used in practice. Both algorithms report the position of the leftmost -repetition.

Let us point out some previous results on the problem. Recall that a repetition of the form is called a square. A string is square-free if it is -repetition-free. Squares are, perhaps, the most extensively studied repetitions. The classical result of Thue [12] states that on a three-letter alphabet there are infinitely many square-free strings. How fast can one decide whether a string is square-free? It turns out that the orderedness of alphabet plays a crucial role here: while any algorithm detecting squares on unordered alphabet requires time [10], it is unlikely that any superlinear lower bound exists in the case of ordered alphabet, in view of the recent result of the author [8]. So, we always emphasize whether an algorithm under discussion relies on order or not.

The best known offline (not online) results are the algorithm of Main and Lorentz [10] detecting -repetitions in time and linear space on unordered alphabet, and Crochemore’s algorithm [4] detecting -repetitions in time and linear space on ordered alphabets. Our interest in online algorithms detecting repetitions was partially motivated by problems in the artificial intelligence research (see [9]), where some algorithms use the online square detection. Apostolico and Breslauer [1] presented a parallel algorithm for this problem on an unordered alphabet. As a by-product, they obtained an online algorithm detecting squares in time and linear space, the best possible bounds as it was noted above. Later, online algorithms detecting squares in [9] and [7] time were proposed. Apparently, their authors were unaware of the result of [1]. For ordered alphabet, Jansson and Peng [7] found an online algorithm detecting squares in time and Hong and Chen [6] presented an online algorithm detecting -repetitions in time and linear space.

An online algorithm for square detection with backtracking is in the core of the generator of random square-free strings described in [11]. Using our algorithm with backtracking, one can in a similar way construct a generator of random -repetition-free strings for any fixed . This result might be useful in further studies in combinatorics on words.

The paper is organized as follows. In Section 2 we present some basic definitions and the key data structure, called catcher, which helps to detect repetitions. Section 3 contains an algorithm with backtracking. In Section 4 we describe a simpler solution without backtracking.

2 Catcher

A string of length over the alphabet is a map , where is referred to as the length of , denoted by . We write for the th letter of and for . Let be the empty string for any . A string is a substring of if for some and . The pair is not necessarily unique; we say that specifies an occurrence of in . A string can have many occurrences in another string. A substring [resp., ] is a prefix [resp. suffix] of . For any , the set (possibly empty) is denoted by ; and denote and respectively.

We fix a rational constant and use it throughout the paper. The input string is denoted by and . Initially, is the empty string. We refer to the operation appending a letter to the right of as read operation and to the operation that cuts off the last letter of as backtrack operation.

Let us briefly outline the ideas behind our results. Both our algorithms utilize an auxiliary data structure based on a scheme proposed by Apostolico and Breslauer [1]. This data structure is called a catcher. Once a letter is appended to the end of , the catcher checks whether has a suffix that is an -repetition of length such that for some segment specific for this catcher. The segment cannot be arbitrary, so we cannot, for example, create a catcher with and . But, as it is shown in Section 3, we can maintain catchers such that the union of their segments covers the whole range from to and hence these catchers “catch” each -repetition in . This construction leads to an algorithm with backtracking. In Section 4 we further reduce the number of catchers to a constant but this solution does not support backtracking.

In what follows we first describe an inefficient version of the read operation for catcher and show how to implement the backtrack operation; then, we improve the read operation and provide time and space bounds for the constructed catcher.

Let and be integers such that . Observe that if for some , the string is an -repetition and , then the string occurs in (see Fig. 1). Given this fact, the read operation works as follows. The catcher searches online occurrences of the string in . If we have , then the number is a period of . The catcher “extends” the repetition to the left with the same period . Then, the catcher online “extends” the repetition to the right with the same period until an -repetition is found. We say that the catcher is defined by and .

Figure 1: An -repetition , where , . Here , , and .
Example 1

Consider . Denote . Suppose . Let a catcher be defined by and (see Fig. 1). We consecutively perform the read operations that append the letters to the right of . The catcher online searches occurrences of the string (e.g., using the standard Boyer-Moore or Knuth-Morris-Pratt algorithm). Once we have , the catcher has found an occurrence of : . Hence, the string has a period . The catcher “extends” this repetition to the left and thus obtains the repetition with period . Then the catcher online “extends” the found repetition to the right: after the next read operation, the catcher obtains the repetition that is an -repetition.

To support the backtrack operation, we store the states of the catcher in an array of states and when the backtracking is performed, we restore the previous state. For the described read operation, this approach has two drawbacks. First, the state does not necessarily require a fixed space, so the array of states may take a large amount of memory. Second, the catcher can spend a lot of time at some text locations (alternating backtracking with reading) and therefore the complexity of the whole algorithm can greatly increase. To solve these problems, our improved read operation performs the “extensions” of found repetitions and the searching of simultaneously.

This approach relies on a real-time constant-space string matching algorithm, i.e., a constant-space algorithm that processes the input string online, spending constant time per letter; once the searched pattern occurs, the algorithm reports this occurrence. For unordered alphabet, we can use the algorithm of Galil and Seiferas [5] though in the case of ordered alphabet, it is more practical to use the algorithm of Breslauer, Grossi, and Mignosi [2].

The improved read operation works as follows. Denote . The real-time string matching algorithm searches for . It is easy to see that if we have , then the number is a period of . The catcher maintains a linked list of pairs , where is found in the described way and is such that is a period of (initially ). Each read operation tries to extend with the same period to the right and to the left. If , then the catcher removes from . To extend to the left, we could assign but the calculation of this value requires time while we want to keep within the constant time on each read operation.

In order to achieve this goal, we will extend symbols to the left after reading a letter. We choose . Then one of two situations occurs at the moment when (i.e., an occurrence of is found). Either we have ( cannot be “extended” to the left) or is an -repetition. Suppose and . Since at this moment we have performed operations decreasing by , we have and hence . Thus, if we put , then and therefore, is an -repetition. The following pseudocode clarifies this description.

1:read a letter and append it to (thereby incrementing )
2:feed the letter to the algorithm searching for
3:if  then found an occurrence
4:      is a period of
5:     
6:for all  do
7:     if  then
8:          cannot be “extended” to the right
9:     else
10:          maximal number of left “extensions”
11:         while  do
12:               “extend” to the left          
13:         if  then if is an -repetition
14:              detected -repetition               

A state of the catcher consists of the list and the state of the string matching algorithm, integers in total. To support the backtracking, we simply store the states of the catcher in an array of states.

Lemma 1

Suppose that and define a catcher on , is the current length of , and . If the conditions (i) is -repetition-free and (ii) hold, then each read or backtrack operation takes time and the catcher occupies space.

Proof

Clearly, at any time of the work, the array of states contains states. Each state occupies integers. Hence, to estimate the required space, it suffices to show that . Denote . It follows from the pseudocode that each corresponds to a unique occurrence of in . Thus, to prove that , it suffices to show that the string has at most occurrences in at any time of the work of the catcher. Suppose occurs at positions and such that . Hence, the number is a period of . Since is -repetition-free during the work of the catcher, we have . Therefore the string always has at most occurrences in the string . Finally, the inequalities and imply .

Obviously, each backtrack operation takes time. Any read operation takes at least constant time for each . But for some , the algorithm can perform iterations of the loop in lines 1112 (see the value of in line 10). Since for each , we have and therefore, the loop performs at most iterations. The loop is executed iff . But since for each , the value of is chosen in such a way that only if is a proper prefix of (see the discussion above), there are at most periods for which the algorithm executes the loop. Finally, we have time for each read operation. ∎

Lemma 2

If for some , the string is an -repetition and , then a catcher defined by and detects this repetition.

Proof

Let be the minimal period of . Since is a substring of and , the string occurs at position . Thus, the catcher detects this -repetition when processes this occurrence (see Fig. 1). ∎

We say that a catcher covers if the catcher is defined by integers and such that ; by Lemma 2, this condition implies that if for some , the suffix is an -repetition, then the catcher detects this repetition. We also say that the catcher covers a segment of length . Note that if we append a letter to the end of , the catcher still covers . We say that a set of catchers covers if , where is a segment covered by catcher .

3 Unordered Alphabet and Backtracking

Theorem 3.1

For unordered alphabet, there is an online algorithm with backtracking that detects -repetitions in time and space, where is the length of a longest string generated during the execution of a given sequence of backtrack and read operations.

Proof

As above, denote . If is not -repetition-free, our algorithm skips all read operations until backtrack operations make -repetition-free. Therefore, in what follows we can assume that is -repetition-free and thus, all -repetitions of are suffixes. In our proof we first give an algorithm without backtracking and then improve it to support the backtrack operation.

The algorithm without backtracking. Our algorithm maintains catchers that cover and therefore “catch” almost all -repetitions. For each , we have a constant number of catchers covering adjacent segments of length . These segments are of the form for some integers precisely defined below. Let us fix an integer constant for which it is possible to create a catcher covering . To show that such exists, consider a catcher defined by . By Lemma 2, this catcher covers iff or, equivalently, . As it will be clear below, to make our catchers fast, we must assume that . Note that since , and implies .

Now we precisely describe the segments covered by our catchers. Denote . For any integer , is a nonnegative multiple of . Let . The algorithm maintains catchers covering the following segments: (see Fig. 2). Thus, there are at most catchers for each such . Obviously, the constructed segments cover .

Figure 2: A system of catchers covering .

To maintain this system of catchers, the algorithm loops through all such that and, if is a multiple of , creates a new catcher covering ; if, in addition, is a multiple of , the algorithm removes two catchers covering and . To prove that the derived system covers , it suffices to note that if an iteration of the loop removes two catchers covering and , for some , then the next iteration creates a catcher covering . We detect -repetitions of lengths by a simple naive algorithm. In the following pseudocode we use the three-operand loop like in the C language.

1:read a letter and append it to (thereby incrementing )
2:check for -repetitions of length
3:for  do
4:     create a catcher covering
5:     if  then
6:         remove the catcher covering
7:         remove the catcher covering      

When the algorithm creates a catcher covering , it has some freedom choosing integers and that define this catcher. We put and . Indeed, in the case we have and, by Lemma 2, the catcher covers ; the case was considered above when we discussed the value of .

Clearly, the proposed algorithm is correct. Now it remains to estimate the consumed time and space. Consider a catcher defined by integers and and covering a segment of length . Let us show that for a constant depending only on and . We have . The inequality implies (here we use the fact that is strictly greater than ). Hence, we can put .

Denote by the value of at the moment of creation of the catcher. The algorithm removes this catcher when either or . Thus, since for some , it follows from Lemma 1 that the catcher requires time at each read operation and occupies space. Hence, all catchers take space and the algorithm requires time at each read operation if we don’t count the time for creation of catchers. We don’t estimate this time in this first version of our algorithm.

The algorithm with backtracking. Now we modify the proposed algorithm to support the backtracking. Denote . The backtrack operation is simply a reversed read operation: we loop through all such that and, if is a multiple of , remove the catcher covering ; if, in addition, is a multiple of , the algorithm creates two catchers covering and . Clearly, this solution is slow: if for some integer , then consecutive backtrack and read operations require time.

To solve this problem, we make the life of catchers longer. In the modified algorithm, the read and backtrack operations don’t remove catchers but mark them as “removed” and the marked catchers still work some number of steps. If a backtrack or read operation tries to create a catcher that already exists but is marked as “removed”, the algorithm just deletes the mark.

How long is the life of marked catcher? Consider a catcher defined by and , where is the value of at the moment of creation of the catcher in the corresponding read operation. The read operation marks the catcher as “removed” when either or ; our modified algorithm removes this marked catcher when or respectively, i.e., the catcher “lives” additional steps. The backtrack operation marks the catcher as “removed” when ; we remove this catcher when (recall that the catcher cannot exist if ), i.e., the catcher “lives” additional steps.

Let us analyze the time and space consumed by the algorithm. It is easy to see that for any , there are at most catchers covering segments of length . The worst case is achieved when we have working catchers and two marked catchers. Now it is obvious that the modified algorithm, as the original one, takes space and requires time in each read or backtrack operation if we don’t count the time for creation of catchers. The key property that helps us to estimate this time is that once a catcher covering a segment of length is created, it cannot be removed during any sequence of backtrack and read operations. To create this catcher, the algorithm requires time and hence, this time for creation is amortized over the sequence of backtrack and read operations. Thus, the algorithm takes overall time, where is the number of read and backtrack operations. ∎

4 Ordered Alphabet

It turns out that in some natural cases we can narrow the area of -repetition search. More precisely, if is -repetition-free, then the length of any -repetition of is close to the length of the shortest suffix of such that does not occur in . In the sequel, is referred to as the shortest unioccurrent suffix of . Denote . Suppose is a suffix of such that is an -repetition. Let us first consider some specific values of .

Example 2

Let . We prove that . Denote by a period of such that . Since the suffix of length occurs in and is -repetition-free, we have . Suppose, to the contrary, . Then and by periodicity of (see Fig. 3 a), a contradiction to the definition of .

Example 3

Let . We show that . As above, we have . Denote by a period of such that . Suppose (or ); then and (see Fig. 3 b), which contradicts to the definition of .

a b

Figure 3: (a) , , , , ;
(b) , , , , .
Lemma 3

Let be the length of the shortest unioccurrent suffix of , and be an -repetition of . If is -repetition-free, then .

Proof

Clearly, is a suffix. We have since the suffix of length occurs in and is -repetition-free. Suppose, to the contrary, (or ). Denote by the minimal period of . We have . Further, we obtain , i.e., . Finally, since is a period of , we have (see Fig. 3 a,b). This contradicts to the definition of . ∎

Lemma 3 describes the segment in which our algorithm must search -repetitions. To cover this segment by catchers, we use the following technical lemma.

Lemma 4

Let and be integers such that and for a constant . Then there is a set of catchers covering such that is a constant depending on and each is defined by integers and  such that .

Proof

Let us choose a number such that . Denote . Consider the following set of catchers : is defined by integers and (see Fig. 4). Denote and . By Lemma 2, covers . Thus, for any , the catcher covers and therefore, the set covers the following segment:

Hence, if and , the set covers . Thus to cover , we can, for example, put and . Finally for , we have . ∎

Figure 4: The system with ( is depicted for clarity), , .

For each integer , denote by the length of the shortest unioccurrent suffix of . We say that there is an online access to the sequence if any algorithm that reads the string sequentially from left to right can read immediately after reading . The following lemma describes an online algorithm for -repetition detection based on an online access to . Note that the alphabet is not necessarily ordered.

Lemma 5

If there is an online access to the sequence , then there exists an algorithm that online detects -repetitions in linear time and space.

Proof

Our algorithm online reads the string while is -repetition-free. Let . Denote and . By Lemma 3, to detect -repetitions, it suffices to have a set of catchers covering . But if the set covers only , then we will have to update the catchers in each step such that or . To reduce the number of updates, we cover with significantly long left and right margins. Thus, some changes of and can be made without rebuilding of catchers.

We maintain two variables and such that . Initially . To achieve linear time, we also require . The following pseudocode explains how we choose and :

1:read a letter and append it to (thereby we increment and read )
2:
3:if  then
4:     
5:     update catchers to cover

The correctness is clear. Consider the space requirements. Since and , it follows that for any . Therefore, by Lemma 4, the algorithm uses a constant number of catchers and hence requires at most linear space. Denote by the number of catchers.

Let us estimate the running time. Observe that never decreases. In our analysis, we assume that to increase , the algorithm performs increments. Obviously, our assumption does not affect the overall running time: to process any string of length , the algorithm executes at most increments. Also the algorithm performs increments of . We prove that the time required to maintain catchers is amortized over the sequence of increments of and .

Suppose the algorithm creates a set of catchers at some point. Denote by the value of at this moment. Let us prove that it takes time to create this set. For , let be defined by and . By Lemma 4, for each , we have . Since , we obtain for any . Hence, by Lemma 1, it takes time to create the catcher . Note that and , i.e., . Therefore, to build the set , the algorithm requires time.

Let us prove that to update the set , the algorithm must execute increments of or . Consider the conditions of line 3:

  1. To satisfy (clearly in this case), since we have for any , we must perform at least increments of .

  2. To satisfy , we must execute increments of .

  3. To satisfy , since and , we must increase by at least .

The third condition forces us to update catchers after increments of . Indeed, we have . Recall that for each , we have and . Hence, by Lemma 1, the catchers take overall time. Thus the time required to maintain all catchers is amortized over the sequence of increments of and . ∎

Theorem 4.1

For ordered alphabet, there exists an algorithm that online detects -repetitions in time and linear space, where is the number of distinct letters in the input string.

Proof

To compute the sequence , we can use, for example, Weiner’s online algorithm [13] (or its slightly optimized version [3]), which works in time and linear space. Thus, the theorem follows from Lemma 5. ∎

Corollary

For constant alphabet, there exists an algorithm that online detects -repetitions in linear time and space.

Acknowledgement. The author would like to thank Arseny M. Shur for the help in the preparation of this paper and Gregory Kucherov for stimulating discussions.

References

  • [1] Apostolico, A., Breslauer, D.: An optimal -time parallel algorithm for detecting all squares in a string. SIAM Journal on Computing 25(6), 1318–1331 (1996)
  • [2] Breslauer, D., Grossi, R., Mignosi, F.: Simple real-time constant-space string matching. In: Combinatorial Pattern Matching. pp. 173–183. Springer (2011)
  • [3] Breslauer, D., Italiano, G.F.: Near real-time suffix tree construction via the fringe marked ancestor problem. Journal of Discrete Algorithms 18, 32–48 (2013)
  • [4] Crochemore, M.: Transducers and repetitions. Theoretical Computer Science 45, 63–86 (1986)
  • [5] Galil, Z., Seiferas, J.: Time-space-optimal string matching. Journal of Computer and System Sciences 26(3), 280–294 (1983)
  • [6] Hong, J.J., Chen, G.H.: Efficient on-line repetition detection. Theoretical Computer Science 407(1), 554–563 (2008)
  • [7] Jansson, J., Peng, Z.: Online and dynamic recognition of squarefree strings. In: Mathematical Foundations of Computer Science 2005, pp. 520–531. Springer (2005)
  • [8] Kosolobov, D.: Lempel-Ziv factorization may be harder than computing all runs. In: 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015). Leibniz International Proceedings in Informatics (LIPIcs), vol. 30, pp. 582–593. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik (2015)
  • [9] Leung, H.F., Peng, Z., Ting, H.F.: An efficient online algorithm for square detection. In: Computing and Combinatorics, pp. 432–439. Springer (2004)
  • [10] Main, M.G., Lorentz, R.J.: Linear time recognition of squarefree strings. In: Combinatorial Algorithms on Words, pp. 271–278. Springer (1985)
  • [11] Shur, A.M.: Generating square-free words efficiently. accepted to WORDS’2013 special issue of Theoretical Computer Science (2014)
  • [12] Thue, A.: Über unendliche zeichenreihen (1906). In: Selected mathematical papers of Axel Thue. Universitetsforlaget (1977)
  • [13] Weiner, P.: Linear pattern matching algorithms. In: Switching and Automata Theory, 1973. SWAT’08. IEEE Conference Record of 14th Annual Symposium on. pp. 1–11. IEEE (1973)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
11486
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description