The sequence of open and closed prefixes of a Sturmian word 111Some of the results contained in this paper were presented at the 9th International Conference on Words, WORDS 2013 DelFi13 ().
A finite word is closed if it contains a factor that occurs both as a prefix and as a suffix but does not have internal occurrences, otherwise it is open. We are interested in the oc-sequence of a word, which is the binary sequence whose -th element is if the prefix of length of the word is open, or if it is closed. We exhibit results showing that this sequence is deeply related to the combinatorial and periodic structure of a word. In the case of Sturmian words, we show that these are uniquely determined (up to renaming letters) by their oc-sequence. Moreover, we prove that the class of finite Sturmian words is a maximal element with this property in the class of binary factorial languages. We then discuss several aspects of Sturmian words that can be expressed through this sequence. Finally, we provide a linear-time algorithm that computes the oc-sequence of a finite word, and a linear-time algorithm that reconstructs a finite Sturmian word from its oc-sequence.
keywords:Sturmian word, closed word.
In a recent paper with M. Bucci BuDelFi13 (), the first two authors dealt with trapezoidal words (a generalization of finite Sturmian words), also with respect to the property of being closed or open. Let be a finite nonempty set (the alphabet). A (finite) word with is closed (also known as periodic-like CaDel01a ()) if it contains a factor that occurs both as a prefix and as a suffix but does not have internal occurrences, otherwise it is open. For example, the words , and are closed — any word of length is closed, the empty word being a factor that occurs both as a prefix and as a suffix but does not have internal occurrences; the words , and , instead, are open.
Given a finite or infinite word , the sequence of open/closed prefixes of , that we refer to as the oc-sequence of , is the binary sequence whose -th element is if the prefix of of length is closed, if it is open. For example, if , then .
A question that arises naturally is whether it is possible to reconstruct a word (up to renaming letters) from its oc-sequence. This is not true in general, even when the alphabet is binary. For example, the words and are not isomorphic (i.e., one cannot be obtained from the other by renaming letters), yet they have the same oc-sequence . As a first result of this paper, we show that if a word is known to be Sturmian, then it can be reconstructed (up to renaming letters) from its oc-sequence. That is, Sturmian words are characterized by their oc-sequences. Moreover, we prove that the class of finite Sturmian words is a maximal element with this property in the class of binary factorial languages.
In BuDelFi13 (), the authors investigated the structure of the sequence of the Fibonacci word . They proved that the lengths of the runs (maximal subsequences of consecutive equal elements) in form the doubled Fibonacci sequence. We prove in this paper that this doubling property holds for every standard Sturmian word, and describe the sequence of a standard Sturmian word in terms of the semicentral prefixes of , which are the prefixes of the form , where are letters and is an element of the standard sequence of . As a consequence, we show that the word , obtained from a standard Sturmian word starting with letter by replacing the first letter with a , can be written as the infinite product of the words , . Since the words are reversals of standard words, this induces an infinite factorization of in squares of reversed standard words.
We then show how the oc-sequence of a standard Sturmian word of slope is related to the continued fraction expansion of , both in terms of the convergents and of the continuants of .
Finally, we provide a linear-time algorithm that computes the oc-sequence of a finite word, and a linear-time algorithm that reconstructs a finite Sturmian word from its oc-sequence.
2 Open and closed words
Let be a finite alphabet. Let and stand respectively for the free monoid and the free group generated by . Their elements are called words over . The length of a word is denoted by . The empty word, denoted by , is the unique word of length zero and is the neutral element of and . If and , we let denote the number of occurrences of in .
A prefix (resp. a suffix) of a word is any word such that (resp. ) for some word . A factor of is a prefix of a suffix (or, equivalently, a suffix of a prefix) of . A prefix/suffix/factor of a word is proper if it is nonempty and does not coincide with the word itself. The set of prefixes, suffixes and factors of the word are denoted by , and , respectively. From the definitions, we have that is a prefix, a suffix and a factor of any word. A border of a word is any word in different from . An occurrence of a factor in is a factorization . An occurrence of is internal if both and are nonempty.
A period of a nonempty word is an integer of the form , where is a border of . We call the period of the least of its periods, that is the difference between the length of and the length of its longest border. Conventionally, the period of is 1. The ratio between the length and the period of a word is called the exponent of .
A factor of a word is left special in (resp. right special in ) if there exist such that and are factors of (resp. and are factors of ). A bispecial factor of is a factor that is both left and right special.
The word obtained by reading from right to left is called the reversal (or mirror image) of . A palindrome is a word such that . In particular, the empty word is a palindrome.
An infinite word over is a sequence , written as . Prefixes and factors of infinite words are naturally defined, as is the product of a finite word and an infinite word . Let denote the set of infinite words over . If is a finite nonempty word, denotes the periodic word . An infinite word is said to be ultimately periodic if there exist two finite words and such that ; an aperiodic word is an infinite word that is not ultimately periodic. An infinite word is recurrent if every factor of occurs infinitely often; equivalently, is recurrent if and only if every prefix of has a second occurrence in .
We recall the definitions of open and closed words given in Fi11 ():
A finite word is closed if it is empty or has a factor occurring exactly twice in , as a prefix and as a suffix of (with no internal occurrences). A word that is not closed is called open.
For any letter and for any , the word is closed, being a factor occurring only as a prefix and as a suffix in it (this includes the special case of single letters, for which and ). More generally, every word whose exponent is at least is closed (BaFiLi15, , Proposition 4).
The notion of closed word is equivalent to that of periodic-like word CaDel01a (). A word is periodic-like if its longest repeated prefix is not right special.
The notion of closed word is also closely related to the concept of complete return to a factor, as considered in GlJuWiZa09 (). A complete return to the factor in a word is any factor of having exactly two occurrences of , one as a prefix and one as a suffix. Hence, is closed if and only if it is a complete return to one of its factors; such a factor is clearly both the longest repeated prefix and the longest repeated suffix of (i.e., the longest border of ).
Let be a nonempty word over . The following characterizations of closed words follow easily from the definition:
the longest repeated prefix (resp. suffix) of does not have internal occurrences in , i.e., occurs in only as a prefix and as a suffix;
the longest repeated prefix (resp. suffix) of is not a right (resp. left) special factor of ;
has a border that does not have internal occurrences in ;
the longest border of does not have internal occurrences in .
Obviously, the negations of the previous properties characterize open words. In the rest of the paper we will use these characterizations freely and without explicit mention to this remark.
We conclude this section with some results on right extensions.
Let be a nonempty word over , and be such that is closed. Then has the same period as .
Let be the longest border of , and be the longest border of . By contradiction, suppose . Then is a prefix of , and therefore has an internal occurrence in , contradicting the hypothesis that is closed. Hence, is the longest border of , so that and have the same period . ∎
For all nonempty , there exists at most one letter such that is closed.
Straightforward after Lemma 4. ∎
If is closed, then exactly one such extension is closed. More precisely, we have the following result (see also (CaDel01a, , Prop. 4)).
Let be a closed word. Then , , is closed if and only if has the same period as .
The case is trivially verified, so let be a nonempty closed word and be its longest border. Let be the letter such that is has the same period as , i.e., such that is a prefix of . Then is closed, as its border cannot have internal occurrences. The converse follows from Lemma 4. ∎
3 The oc-sequence of a word
We now define the oc-sequence of a word.
Let be a finite or infinite word over . We define , called the oc-sequence of , as the binary sequence whose -th element is if the prefix of length of is open, or if it is closed.
For example, if , then .
By definition of closed word, for each integer , the -st occurrence of 1 in is at the position corresponding to the end of the second occurrence of the prefix of length in . Hence, if a finite word admits a border of length , then
In particular, a closed word is a complete return to its prefix of length ; equivalently, the period of a closed word is equal to .
In the following two propositions we relate recurrence and periodicity of an infinite word with analogous properties of its oc-sequence.
Let . The following are equivalent:
for a letter ;
Clearly, . To complete the proof, we show that . Let then be recurrent, and suppose by contradiction that occurs in it. Thus, there exists a positive integer such that occurs infinitely often in . Hence, for every , there exists such that is a prefix of and . Let be the prefix of length of ; by Remark 8, we obtain that the prefixes of of length and both have as a suffix. We have found two occurrences of at distance from each other, so that must have as a period. Since is arbitrary and , it follows that has period , so that ends in as a consequence of Lemma 6. This contradicts the hypothesis that is recurrent and contains . ∎
Let . The sequence is ultimately periodic if and only if is either periodic or not recurrent. In the first case, ends in , while in the latter case it ends in .
The “if” part is immediate. Let us then prove the “only if” part; let . Suppose first that does not occur in . Then ends in , so that has prefixes that have no other occurrences in ; hence, is not recurrent. If does not occur in , then ends in so that is periodic as a consequence of Lemma 6. Finally, suppose that both and occur in . Then there exists a positive integer such that occurs infinitely often in ; as we have seen in the proof of Proposition 9, this leads to a contradiction. ∎
The following lemma shows that in the sequence any run of s is at least as long as the previous run of s. It will be useful in what follows.
Given positive integers and , if is a factor of then .
Let with , and let with for all integers . Let be the letter such that . The result is clear in the case when is a prefix of , for this implies that begins in , where is a letter in different from . Since the longest border of is the empty word, it follows that the next occurrence of must occur within the suffix of , so that whence .
We may now assume that occurs in at some later position. Fix a positive integer such that is a suffix of . Let and be the prefix of of length . We note that since occurs in and not just as a prefix, we have and (hence is nonempty). It follows that there exist distinct letters such that begins in and terminates in . Hence, the second occurrence of in terminates in position , while the second occurrence of in terminates in position . If the second occurrence of in does not overlap the second occurrence of in , then . If the second occurrence of in overlaps the second occurrence of in by an amount , then we have that and has a border of length . Let denote the longest border of . Thus . First suppose that either or but and do not belong to the same run. Then, since by Remark 8, we deduce that
Finally, suppose with and belonging to the same run. In this case, and are both closed, so that is a prefix of . Therefore , since is a prefix of as well, and hence has a border of length . Now, let be the prefix of (and of ) that terminates with the first occurrence of ; then is necessarily open, and by Remark 8. It follows that if is a suffix of , hence . Thus, . ∎
3.1 Sturmian words
We let be a fixed binary alphabet from now on, unless otherwise specified. An element of is a Sturmian word if it contains exactly distinct factors of length , for every . A famous example of Sturmian word is the Fibonacci word
that is the limit, as , of the sequence of words , called the sequence of finite Fibonacci words, defined by , and, for every , .
It is well known that if is a Sturmian word then at least one of and is also a Sturmian word. A Sturmian word is called standard (or characteristic) if and are both Sturmian words. The Fibonacci word is an example of standard Sturmian word. In the next section, we will deal specifically with standard Sturmian words. Here, we focus on finite factors of Sturmian words, called finite Sturmian words. Actually, finite Sturmian words are precisely the elements of verifying the following balance property: for any such that one has (or, equivalently, ).
We let denote the set of finite Sturmian words. The language is factorial (i.e., if , then ) and extendible (i.e., for every there exist letters such that ).
We recall the following definitions given in DelMi94 ().
A word is a left special (resp. right special) Sturmian word if (resp. if ). A bispecial Sturmian word is a Sturmian word that is both left special and right special. Moreover, a bispecial Sturmian word is strictly bispecial if and are all Sturmian words; otherwise it is non-strictly bispecial.
For example, the word is a bispecial Sturmian word, since , , and are all Sturmian. This example also shows that a bispecial Sturmian word is not necessarily a bispecial factor of some Sturmian word (which must be a palindrome); in fact, bispecial factors of Sturmian words coincide with strictly bispecial Sturmian words (see Fi14 () for more details on bispecial Sturmian words).
It is known that if is a left special Sturmian word, then is a prefix of some standard Sturmian word, and the left special factors of are prefixes of . Symmetrically, if is a right special Sturmian word, then the right special factors of are suffixes of .
Regarding open and closed prefixes of Sturmian words, we prove the following result.
Every (finite or infinite) Sturmian word is uniquely determined, up to isomorphisms of the alphabet , by its oc-sequence .
We need some intermediate lemmas.
Let be a right special Sturmian word and let be its longest repeated prefix. Then is a suffix of .
If is closed, the claim follows from the definition of closed word. If is open, then is right special in , and by Remark 13 is a suffix of . ∎
Let be a right special Sturmian word. Then or is closed.
Let be the longest repeated prefix of and be the letter following the occurrence of as a prefix of . By Lemma 15, is a suffix of . Clearly, the longest repeated prefix of is , which is also a suffix of and cannot have internal occurrences in , otherwise the longest repeated prefix of would not be . Therefore, is closed. ∎
So, by Lemmas 5 and 16, if is a right special Sturmian word, then one of and is closed and the other is open. This implies that the oc-sequence of a (finite or infinite) Sturmian word characterizes it up to exchange of letters. The proof of Theorem 14 is therefore complete.
We now prove that is maximal in the class of factorial languages over verifying the condition of Theorem 14, i.e., such that their members are determined by their sequences. Let us write when two words are isomorphic, and let
We note that is nonempty (e.g., ), partially ordered with respect to inclusion, and such that every increasing chain
with all has an upper bound in given by Thus, by Zorn’s lemma, admits at least one maximal element.
is a maximal element of
Again we need to recall two lemmas. The first is a well-known result about balanced words (cf. (LothaireAlg, , Proposition 2.1.3)):
A word is not balanced if and only if there exists a palindrome such that .
Next is an immediate consequence of known properties of Christoffel words (cf. Fi14 ()).
A word is a non-strictly bispecial Sturmian word if and only if there exists a strictly bispecial Sturmian word and an integer such that
Proof of Theorem 17.
It follows from Theorem 14 that To see that is a maximal element of we show that no element of properly contains . Suppose to the contrary that there exists an element such that . Let be an element of minimal length of not belonging to . By Lemma 18, there exists a word such that . Since all proper factors of are balanced, without loss of generality we can assume that is a prefix of and is a suffix. Hence we can write for some .
Let be a border of . Since is balanced, we have . Writing and , it follows that and , whence by our minimality assumption on . Therefore is open, so that terminates in . We will show that and terminates in . It follows then that and that , a contradiction since .
By definition of it follows that . By minimality of the length of we have . Thus and , so that ; in other words, is a bispecial Sturmian word. On the other hand, as , we have that is non-strictly bispecial. Thus, by Lemma 19, there exists a word such that for some . Hence . Clearly, occurs only once in , as all other factors of the same length have one less occurrence of the letter . Thus, if is a border of , then . It follows that is a proper suffix of and so it has an internal occurrence in (as a proper suffix of ). Therefore is open, so that terminates in , as required. ∎
3.2 Standard Sturmian words
In this section, we deal with the oc-sequence of standard Sturmian words. In BuDelFi13 () a characterization of the oc-sequence of the Fibonacci word was given.
Let be an irrational number such that , and let be the continued fraction expansion of . The sequence of words defined by , and for , converges to the infinite word , called the standard Sturmian word of slope . The sequence of words is called the standard sequence of .
Note that starts with letter if and only if , i.e., if and only if . In this case, is the continued fraction expansion of , and is the word obtained from by exchanging ’s and ’s. Hence, without loss of generality, we will suppose in the rest of the paper that starts with letter , i.e., that .
For every , one has
for letters such that if is odd or if is even. Indeed, the sequence can be defined by: , , and, for every ,
where are as in (1).
The Fibonacci word is the standard Sturmian word of slope , whose continued fraction expansion is , so that for every . Therefore, the standard sequence of the Fibonacci word is the sequence defined by: , , for . This sequence is the sequence of finite Fibonacci words.
A standard word is a finite word belonging to some standard sequence. A central word is a word such that is a standard word, for letters .
It is known that every central word is a palindrome. Actually, central words play a central role in the combinatorics of Sturmian words and have several combinatorial characterizations (see Be07 () for a survey). We summarize some of these properties in the following proposition.
Let be a word over . The following are equivalent:
is a central word;
is a palindromic bispecial Sturmian word;
is a power of a single letter or it can be written as
for some words and and distinct letters .
Moreover, in this latter case, and are central words themselves, and is a complete return to the longest between and . In particular, central words are closed.
Let be a standard sequence. It follows by the definition that for every and , the word is a standard word. In particular, for every , the word is a standard word. Therefore, for every , we have that
is a central word.
The following lemma is a well-known result (cf. fisch ()).
Let be a standard Sturmian word and let be its standard sequence. Then:
A standard word is a prefix of if and only if , for some and .
A central word is a prefix of if and only if , for some , , and distinct letters such that if is odd or if is even.
Note that is a central prefix of , but this does not contradict the previous lemma since, by (2), .
Recall that a semicentral word (see BuDelFi13 ()) is a word in which the longest repeated prefix, the longest repeated suffix, the longest left special factor and the longest right special factor all coincide. The following proposition summarizes some properties of semicentral words proved in BuDelFi13 ().
A word is semicentral if and only if for a central word and distinct letters . Moreover, has exactly one internal occurrence in , and this occurrence is preceded by and followed by . In particular, semicentral words are open (whereas central words are closed).
The semicentral prefixes of are precisely the words of the form , , where and are as in (1).
Since is a central word, the word is a semicentral word by definition, and it is a prefix of , which in turn is a prefix of by Lemma 24.
for some , , and distinct letters such that if is odd or if is even. In particular, this implies that .
If , then , and we are done. So, suppose by contradiction that . Now, on the one hand we have that is a prefix of by Lemma 24, and so is followed by as a prefix of ; on the other hand we have
so that is followed by as a prefix of , a contradiction. ∎
The next theorem shows the behavior of the runs in by determining the structure of the last elements of the runs.
Let , , be a prefix of . Then:
is open and is closed if and only if there exists such that ;
is closed and is open if and only if there exists such that .
1. If , then is semicentral and therefore open. The word is closed since its longest repeated prefix occurs only as a prefix and as a suffix in it.
Conversely, let be a closed prefix of such that is open, and let be the longest repeated suffix of . Since is closed, does not have internal occurrences in . Since is the longest repeated prefix of (suppose the longest repeated prefix of is a longer than , then , which is a prefix of , would be repeated in and hence in , contradiction) and is open, must have an internal occurrence in followed by a letter . Symmetrically, if is the letter preceding the occurrence of as a suffix of , since is the longest repeated suffix of one has that has an internal occurrence in preceded by a letter . Thus is left and right special in . Moreover, is the longest special factor in . Indeed, if is a left special factor of , then must be a prefix of . But cannot appear in since is closed, and if was a left special factor of , it would be a prefix of . Symmetrically, is the longest right special factor in . Thus is semicentral, and the claim follows from Proposition 26.
2. If , then is a central word and therefore it is closed. Its longest repeated prefix is . The longest repeated prefix of is either (if ) or (if ); in both cases, it has an internal occurrence as a prefix of the suffix . Therefore, is open.
Conversely, suppose that is any open prefix of such that is closed. If , then and we are done. Otherwise, by 1), there exists such that , where . We know that is closed and is open; it follows , as otherwise there should be in a semicentral prefix strictly between and . ∎
Note that, for every , one has:
Therefore, starting from an (open) semi-central prefix , one has a run of closed prefixes, up to the prefix , followed by a run of the same length of open prefixes, up to the prefix . See Table 1 for an illustration.
In Table 2, we show the first few elements of the sequence for the standard Sturmian word of slope , i.e., with and for every . One can notice that the runs of closed prefixes are followed by runs of the same length of open prefixes.
Multiplying (4) on the left by and on the right by , one obtains
Since , one has that , and therefore is the reversal of a standard word. By (5), is the reversal of a standard word.
Now, note that for , one has and . Thus, we have the following:
Let be the standard Sturmian word of slope , with , and let , with , be the continued fraction expansion of . The word , obtained from by replacing the first letter with the letter , can be written as an infinite product of squares of reversed standard words in the following way:
where is the sequence defined in (1).
In other words, one can write
Take the Fibonacci word. Then, , , , , , etc. So, , , , , etc. Indeed, is the reversal of the Fibonacci finite word . By Theorem 28, we have:
i.e., can be obtained by concatenating and the squares of the reversals of the finite Fibonacci words starting from .
Note that can also be obtained by concatenating the reversals of the finite Fibonacci words starting from :
and also by concatenating and the finite Fibonacci words starting from :
For a survey on various factorizations of the Fibonacci infinite word that make use of finite Fibonacci words the reader can see Fi15 ().
One can also characterize the oc-sequence of a standard Sturmian word in terms of the directive sequence of .
Recall that the continuants of an integer sequence are defined as , , and, for every ,
Continuants are related to continued fractions, as the -th convergent of is equal to .
Let be a standard Sturmian word and its standard sequence. Since and, for every , then one has, by definition, that for every
For more details on the relationships between continuants and Sturmian words the reader can see dL13 ().