Online Square Detection
The online square detection problem is to detect the first occurrence of a square in a string whose characters are provided as input one at a time. Recall that a square is a string that is a concatenation of two identical strings. In this paper we present an algorithm solving this problem in time and linear space on ordered alphabet, where is the number of different letters in the input string. Our solution is relatively simple and does not require much memory unlike the previously known online algorithm with the same working time. Also we present an algorithm working in time and linear space on unordered alphabet, though this solution does not outperform the previously known result with the same time bound.
The study of algorithms for analysis of different kinds of periodicities in strings constitutes an important branch in stringology and squares often play a central role in such researches. Recall that a string is a square if for some nonempty string . A string is squarefree if it does not have a substring that is a square. We consider algorithms recognizing squarefree strings. To be more precise, let be a positive function of integer domain; we say that an algorithm detects squares in time if for any integer and any string of length , the algorithm decides whether the string is squarefree in operations. We say that an algorithm detects squares online if the algorithm processes the input string sequentially from left to right and decides whether each prefix is squarefree after reading the rightmost letter of that prefix.
In this paper we give two algorithms for online square detection. The first one works on ordered alphabet and requires time and linear space, where is the number of different letters in the input string. Though the proposed result is not new (see [HC08]), it is rather less complicated than the previously known solution and can be used in practice. The second algorithm works on unordered alphabet and takes time and linear space. This algorithm is substantially different from the algorithm of [AB96] having with the same time and space bound. These two algorithms are equally applicable for practical use. Let us point out some previous results on the problem of square detection.
One can easily show that on a two-letter alphabet any string of length at least four is not squarefree. A classical result of Thue [Th1906] states that on a three-letter alphabet there are infinitely many squarefree strings. Main and Lorentz [ML85] presented an algorithm that detects squares in time and linear space for unordered alphabet. They proved an lower bound for the problem of square detection on unordered alphabets, so their result is the best possible in this case. For ordered alphabets, Crochemore [Cr86] described an algorithm that detects squares in time and linear space.
The interest in algorithms for online square detection was initially motivated by problems in the artificial intelligence research (see [LPT04]). Leung, Peng, and Ting [LPT04] obtained an online algorithm that detects squares in time and linear space on unordered alphabet. Jansson and Peng [JP05] found an online algorithm that detects squares in time; for ordered alphabets, their algorithm requires time. Hong and Chen [HC08] presented an online algorithm that detects squares in time and linear space on ordered alphabet. Their algorithm heavily relies on the amount of space consumed by string indexing structures and hence is rather impractical; it seems that even the most careful implementations of the algorithm use at least bytes in the worst case. Apostolico and Breslauer [AB96] as a byproduct of their parallel algorithm for square detection obtained an online algorithm that detects squares in time and linear space on unordered alphabet (apparently, the authors of [LPT04] and [JP05] did not know about this result).
The present paper is inspired by Shur’s work [Sh14] on random generation of square-free strings.
A string of length over an alphabet is a map . The length of is denoted by . We write for the th letter of and for . Let be the empty string for any . A string is a substring of if for some and . The pair is not necessarily unique; we say that specifies an occurrence of in . A string can have many occurrences in another string. A substring (resp., ) is a prefix [resp. suffix] of . A string which is both a proper prefix and a suffix of is a boundary of . A square suffix is a suffix that is a square. For any integers , the set (possibly empty) is denoted by .
Suppose is a string, . To detect squares, we use an auxiliary data structure, called catcher. The catcher works with the string . For correct work, the string must be squarefree. The catcher contains integer variables , such that and and once a letter is appended to the right of , the catcher detects square suffixes beginning inside , i.e., if for some , is a square, the catcher detects this square. The segment is called the trap.
Suppose for some nonempty string and (see fig. 1). Since , we have for some strings and such that .
If is squarefree, then is the longest boundary of .
Suppose is the longest boundary of and ; then is a suffix of and the occurrence of that starts at position overlaps the occurrence of that starts at position . Thus, is not squarefree. This is a contradiction. ∎
Denote . To obtain the longest boundary of , the catcher maintains an integer array (for convenience, we use indices ) such that for any , is equal to the length of the longest boundary of . Further, the catcher contains a variable such that if , equals the length of the longest suffix of that is a suffix of and otherwise, equals zero. Thus by Lemma 1, the catcher detects a square iff .
Let us describe how to compute and . There is a well-known algorithm that efficiently calculates (see [CR02]). Once is found, we process . If , then remains unchanged (see the definition); otherwise we put . Next, if and , we compute by a naive algorithm. The following pseudo-code summarizes the description (for convenience, we define ).
The catcher requires time and space.
The algorithm that fills takes time (see [CR02]).
Suppose there exists a positive integer such that our algorithm computed a nonzero value of when had the length . Let be the set of all such integers. For each , denote by the value of that was computed when had the length . It suffices to prove that . Recall that the string is squarefree. Therefore by Lemma 1, the condition in line 6 implies for all . Denote for . It is straightforward that is an increasing sequence. Let be the maximal integer such that and . Let us first prove that .
It follows from the definition of catcher that for any , . Therefore for each , is a boundary of . Let us show that for all . Suppose for some (see fig. 2). Denote . Then by definition of . But is a prefix of because is a boundary of . So, we obtain a square and this is a contradiction. Thus, .
To estimate the sum , we first prove the following statement.
Suppose, to the contrary, for some , and there are such that and (see fig. 3). Denote . By definition of , we have . Now it is easy to see that . Indeed, if , then because is a boundary of ; but this implies that is a square. Thus . Since, by definition of and , and is squarefree, we have . Recall that by definition of . Finally, we obtain . But the condition in line 6 suggests that . This is a contradiction.
Now we can estimate the sum . If for each , , then . Let be the set of all such that . Denote . It follows from (1) that for at most one , . So, . In the same way we obtain that for at most one , . So, . Further, for at most one , . So, . This process leads to the inequality . Finally, we have . ∎
3 Unordered Alphabet
For unordered alphabet, there exists an online algorithm that detects squares in time and linear space.
Our algorithm maintains catchers and traps of these catchers cover the string . Let and be the maximal integer such that . We have one or two traps of the length : the first trap is equal to and if is even, we have another trap that is equal to (see fig. 4). If has became a multiple of after extension of , we add a new trap of the length and destroy two previous traps of the length if the new is odd and these traps exist. In the following pseudo-code we use the three-operand loop like in the C language.
To prove that the described system of traps covers the string , it suffices to note that if some iteration of the loop removes two catchers with the traps of the length , then the next iteration creates a catcher with the trap of the length on their place.
It is easy to see that if for some , the proposed algorithm maintains a trap of length , then . Hence it follows from Lemma 2 that all catchers of the algorithm use space. Since traps of the same length do not intersect, to maintain traps of the length , the algorithm takes, by Lemma 2, time. Thus, the algorithm requires overall time. ∎
4 Ordered Alphabet
In the case of ordered alphabet the following lemma narrows the area of square suffix search.
Let be a string of length . Denote by the length of the longest suffix of that occurs at least twice in . If is squarefree and for some positive , is a square, then .
Suppose . Since has two occurrences in , the square occurs twice and the string is not squarefree. This is a contradiction.
Suppose . Note that is even. Then the suffix has at least two occurrences in . This contradicts to the definition of . ∎
For each integer , denote by the length of the longest suffix of that has at least two occurrences in . We say that there is an online access to the sequence if any algorithm that reads the string sequentially from left to right can read immediately after reading .
The following lemma describes an online algorithm for square detection based on an online access to . Note that the alphabet is not necessarily ordered.
If there is an online access to the sequence , then there exists an algorithm that online detects squares in linear time and space.
Our algorithm online reads the string (as above, denotes the number of letters read) and maintains an integer variable such that (initially ). To detect a square, we use three catchers; the traps of these catchers are denoted by , , and . The traps satisfy the following conditions (any of these traps can be empty; see fig. 5):
Thus, the traps cover the block and therefore, by Lemma 3, the algorithm detects squares. Consider the following pseudo-code that maintains the catchers and the variable (“update” command replaces the corresponding catcher with a new one):
Clearly, the proposed algorithm preserves (2) and thus by Lemma 3, works correctly. Further, it follows from Lemma 2 that the algorithm uses linear space. To end the proof, it suffices to estimate the working time.
Suppose the algorithm processed a string of length . Let for any , the term th step refers to the set of instructions performed by the algorithm when it read and processed the letter . For any , denote by the value of that was calculated on th step. Let be the set of all steps on that the algorithm performed the line 4. At first we estimate the time required by steps ; then we estimate the time required by other steps: the updates of the first and second catchers and finally the updates of the third catcher.
Consider the steps . It follows from Lemma 2 that for any , the updates of the first, second, and third catchers on th step require , , and time respectively. Since , we have . Therefore, the steps require time.
Let . Consider the steps . Let be the set of all steps on that the algorithm updated the first and second catchers. Clearly . For any , the recalculation of the first and second catchers on th step takes time. The condition in line 6 implies that for any , . Hence the recalculation is performed in time.
Let be the set of all steps on that the algorithm updated the third catcher. Clearly . For any , this recalculation takes time on th step. It follows from the lines 9–10 that for any , . Then the recalculation takes time.
Thus, all recalculations of the catchers are performed in time and this result ends the proof. ∎
For ordered alphabet, there exists an algorithm that online detects squares in time and linear space, where is the number of different letters in the input string.
For constant alphabet, there exists an algorithm that online detects squares in linear time and space.
Some important problems still remain open. To date, there is no nontrivial lower bound for the problem of square detection in the case of ordered alphabet. It follows from [KK99] that such lower bound immediately implies the same lower bound for the problem of Lempel-Ziv factorization (the later is a widely used tool in stringology and data compression).
It is also interesting to construct an efficient online algorithm for square detection that allows a “rollback” operation, i.e., the operation that cuts off a suffix of arbitrary length from the read string. One such algorithm is presented in [Sh14].
- [AB96] A. Apostolico, D. Breslauer. An Optimal -Time Parallel Algorithm for Detecting all Squares in a String, SIAM Journal on Computing 25 6. – (1996) 1318–1331.
- [Cr86] M. Crochemore. Transducers and repetitions, Theoretical Computer Science 45 (1986) 63–86.
- [CR02] M. Crochemore, W. Rytter. Jewels of stringology, World Scientific Publ. (2002).
- [HC08] J-J. Hong, G-H. Chen. Efficient on-line repetition detection, Theoretical Computer Science 407 (2008) 554–563.
- [JP05] J. Jansson, Z. Peng. Online and dynamic recognition of squarefree strings, MFCS (2005) 520–531.
- [KK99] R. Kolpakov, G. Kucherov. Finding maximal repetitions in a word in linear time, FOCS 40 (1999) 596–604.
- [LPT04] H.F. Leung, Z. Peng, H. F. Ting. An efficient online algorithm for square detection, Computing and Combinatorics (2004) 432–439.
- [ML85] M.G. Main, R.J. Lorentz. Linear time recognition of squarefree strings, Combinatorial Algorithms on Words (1985) 271–278.
- [OS08] D. Okanohara, K. Sadakane. An online algorithm for finding the longest previous factors, Algorithms-ESA 2008. Springer Berlin Heidelberg (2008) 696–707.
- [Sh14] A.M. Shur. Generating square-free words efficiently, WORDS’2013 special issue of Theoret. Comput. Sci. (submitted, 2014).
- [Th1906] A. Thue. Über unendliche Zeichenreihen, Norske Videnskabers Selskabs Skrifter, Mat.-Nat. Kl. 7 (1906) 1–22.
- [Uk95] E. Ukkonen. On-line suffix tree construction. Algorithmica 14.3 (1995) 249–260.