From Regular to Strictly Locally Testable Languages ††thanks: Extended Abstract.
A classical result (often credited to Y. Medvedev) states that every language recognized by a finite automaton is the homomorphic image of a local language, over a much larger so-called local alphabet, namely the alphabet of the edges of the transition graph. Local languages are characterized by the value of the sliding window width in the McNaughton and Papert’s infinite hierarchy of strictly locally testable languages (-slt). We generalize Medvedev’s result in a new direction, studying the relationship between the width and the alphabetic ratio telling how much larger the local alphabet is. We prove that every regular language is the image of a -slt language on an alphabet of doubled size, where the width logarithmically depends on the automaton size, and we exhibit regular languages for which any smaller alphabetic ratio is insufficient. More generally, we express the trade-off between alphabetic ratio and width as a mathematical relation derived from a careful encoding of the states. At last we mention some directions for theoretical development and application.
A classical result , often credited to Y. Medvedev , states that every regular language is the homomorphic image of a local language over a larger alphabet called local. In a local language the sentences are characterized by three sets: the initial letters, the final letters and the set of factors of length . Parameter is the width of the simplest sliding window device introduced by McNaughton and Papert . The result simply derives from the fact that the set of paths in an edge-labelled graph is a local language over the alphabet of the edges. Considering a finite automaton for the regular language, the local language of accepting paths can be naturally projected on the original language.
Our work originates from two observations. First, in the classic result the alphabet of the local language is larger than the source alphabet, by a multiplicative factor, to be called the alphabetic ratio, in the order of the square of the number of states. The simplicity of sliding window machines and languages is very attractive, but the huge size of the local alphabet in Medvedev theorem makes their application impractical.
Then a natural question concerns the local alphabet in the classical result: how small can the alphabetic ratio be? A small alphabet may, for instance, allow to encode messages from a regular language into an slt language, to be transmitted over a communication channel, so that a more economical sliding window receiver can be used instead of a general finite state machine.
Second, the local languages are a member of McNaughton and Papert’s  infinite hierarchy of -strictly locally testable, for short -slt, languages. Then, by considering -slt, instead of just 2-slt i.e., local languages, we raise a more general question: what is the minimum alphabetic ratio such that, for some finite parameter , every regular language is the alphabetic homomorphism of a -slt language? In that case, how big does the width parameter need to be? More precisely, our main result, which generalizes Medvedev theorem, expresses the trade-off between two parameters: the alphabetic ratio and the width.
We spend a few lines to show the early but enduring interest for subfamilies of regular languages characterized by some form of local testability, without entering into details.
At the basis of formal language theory, the classical theorem of N. Chomsky and M.P. Schutzenberger characterizes context-free languages by a homomorphism applied to the intersection of a Dyck language and a 2-slt one. Several similar characterizations for other language families have later been proved. In mathematics, the slt languages have been applied in the theory of semigroups by A. De Luca and A. Restivo . In linguistics, a persistent idea is that natural languages can be modeled, at various levels, by locally testable properties. For instance, the psychologist W. Wickelgren  made the observation that the set of English words are essentially a 3-slt (finite) language, and several brain scientists (in particular V. Braitenberg ) have suggested that sequences of finite length, such as the factors occurring in a locally testable language, can be easily stored and recognized by certain neural circuits (in particular the synfire chains of M. Abeles) that have been observed in the cortex. In computational linguistics locally testable definitions have proved to be useful at various levels of finite-state models. Many persons (e.g. ) working on language learning models have been attracted by the efficiency of learning algorithms for various types of locally testable languages. Contemporary comparative work on the aural pattern recognition cababilities of humans and animals  have called attention to the subregular hierarchies induced by local testability. In mathematical biology, in his seminal article on language theory and DNA , T. Head shows that certain splicing languages are precisely the slt languages.
The paper is organized as follows. After the basic definitions in Section 2, we introduce in Section 3 a new classification of regular languages based on their homomorphic characterization via a -slt language over an alphabet of size . In Section 3 we prove a lower bound on the alphabetic ratio. In Section 4 we state and demonstrate a generalization of Medvedev theorem, including a mathematical analysis of the relationship between language complexity, alphabetic ratio, and width. The Conclusion presents an open problem and mentions conceivable developments and applications of the main result.
The empty word is denoted by . The terminal alphabet of the source language is denoted by . For simplicity we deal only with languages in , which do not contain the empty word. The cardinality of an alphabet will be called the arity; the arity of a language is the arity of its alphabet.
A nondeterministic finite automaton
(NFA) is a quintuple where is a finite
set of states, is a finite alphabet, the transition relation (or graph) is ,
is the initial state; is the set of final
states, which does not contain (since only -free languages are considered).
Two transitions and are consecutive if . A path is a finite sequence of consecutive transitions , , , . The origin of is , its end is , and its label is . A successful path is a path with origin and end in . The language recognized by , denoted , is the set of labels of all successful paths of .
We assume, without loss of generality, that the transition relation is total, i.e., for every , set (if is not total, just add a new sink state to ).
Given another finite alphabet , an (alphabetic) homomorphism is a mapping . For a language , its (homomorphic) image under is the language .
For every word , for every , let and denote the prefix and, respectively, the suffix of of length if , or itself if . Let denote the set of factors of of length . Extend to languages as usual, i.e., , , and . A factor of a word starting at position and ending at position , with , is defined as follows:
Hence, for , .
A language is -strictly locally testable111The original name in  is “-testable in the strict sense”. This concept should not be confused with other language families based on local tests, see  for a recent account., shortly -slt, if there exist finite sets and such that, for every , the following condition holds:
A language is strictly locally testable (slt) if it is -slt for some to be called the width.
This definition ignores words shorter than , which however can be checked directly against a finite set, if needed. The case corresponds to the very well known family of local languages (see for instance  or ). The following example will be referred to later.
The language is -slt, i.e., local, since it can be defined by the sets , , .
It is known and straightforward to prove that the family of slt languages is strictly included in the family of regular languages, and it is an infinite strict hierarchy ordered by the width value. For instance, the language on , with a constant, is -slt, but it is not -slt. In fact, is defined by the sets: , , . However, is not -slt: consider the words and : , , . Hence, the two words above cannot be distinguished by using width .
3 Lower Bounds
As said, every regular language, to be referred to as source, is the image of a 2-slt language whose arity may be much larger than the arity of the source. To talk precisely about the width of the slt language and of the ratio of the arities of the slt and source languages, we introduce a definition.
For , a language is -homomorphic if there exist an alphabet (called local) of arity , a -slt language , and a homomorphism such that .
Clearly, if is -slt then is trivially -homomorphic. Otherwise, a local alphabet larger than is needed. For instance, the language is not slt but the language of Ex. 1 is -slt. By defining as , , then and hence is -homomorphic. The alphabetic ratio of and is .
The traditional construction (e.g. in ) of a 2-slt language considers an NFA of size for , and uses set as local alphabet, i.e., up to elements. Hence we can restate Medvedev’s property saying that every regular language on is -homomorphic (the alphabetic ratio is ). However, it is straightforward to show that the arity of the local alphabet can be reduced to .
Every regular language, accepted by an NFA with states, is -homomorphic.
Let be an NFA. Define two mappings and such that , for every , and for every . The following sets define a 2-slt language :
We show first that
. Let .
Hence, there exists such that .
We claim that there exists a successful path of such that .
Let . Since , there exist , such that
, and .
Since , there exists such that .
Let be : has label , origin in and end in a final state;
By definition of , every factor
of , for , must be such that , hence all transitions of are consecutive, i.e., is a successful path of label .
We show that . Let be accepted by a successful path of of the form
with and . We claim that . In fact, , and . Since each (being a transition of ), . ∎
A natural question to be later addressed, is whether, by allowing the width to be larger than 2, it is possible to reduce the arity of the local alphabet to less than . Next we prove the simple, but perhaps unexpected result, that the local alphabet cannot be smaller than twice the size of the source one.
For every alphabet , there exists a regular language that is not -homomorphic, for every .
Let be defined by the regular expression . By contradiction, assume that there exist and a local alphabet of arity , a mapping and a -slt language such that . Since , there exists at least one symbol of , say, , such that there is only one symbol such that . Since the word , there exists such that . By definition of and of , . Consider the word . Clearly, , which is not in , since all words in have even length. Hence, . But , , and, by Definition 1, is in , a contradiction. ∎
The same result holds (with a very similar proof) if in the statement the class of strictly locally testable languages is replaced by the class of locally testable languages222They are the boolean closure of slt languages, see .. The question whether an alphabetic ratio of two is sufficient is addressed in the next section.
4 Main Result
The intuitive idea that by increasing the width one can use a smaller alphabet for the slt language, is studied in detail. Our approach consists of defining an slt language using a larger alphabet that encodes the states traversed by the original automaton into words of fixed length. Our main theorem states the relationship between the language complexity in terms of number of states, the alphabetic ratio, and the width of the slt language.
If a language is accepted by a NFA with states, then for every , is -homomorphic.
The rest of the section is devoted to the proof. Special care is devoted to find a very succinct encoding of the original states into strings of the local alphabet, in order to reach the minimal alphabetic ratio. Since it may be important for applications, our encoding produces also a small, although not optimal, width of the slt language. The proofs are organized so that the main lemmas hold, independently of the chosen encoding, which only affects the numerical results. This organization has the advantage that the proof is essentially unaffected by the encoding.
The next definitions set the base for stating the properties a good encoding should have. Only fixed-length encodings are considered. Let be a finite alphabet. Let be a NFA, where is total, and let .
Given an integer , a code of into of length is a mapping such that for every , if then . Consider a word that is a factor of . We want to decode to one state. This will be useful when defining a slt language whose homomorphic image is . If , since may include the concatenation of and , , it is not decodable to just one state symbol; moreover, if then may not contain any factor of the form . However, if is exactly , then the word is bound to include at least one factor of the form , for some , which can be decoded to . In addition, we want this decoding to be unique.
The traditional notion of decodability (for every , if then ) is not adequate, since it assumes that the word to be decoded is a string in , while we need to consider a factor of . A word is said to be factor-decodable if there exists one, and only one, position , , such that there exists : . A code is factor-decodable if every word in is factor-decodable.
For all finite alphabets of cardinalities and , with , , there exists a factor-decodable code of into of length , with:
Sketch of the proof. Let be a symbol. The idea is to let code be such that for every , ends with the word , i.e., and there is no other occurence of 00 in . Formally, for every , , if then . This is enough for factor-decodability. To find how large must be as a function of and , first consider, for every , the set of words in such that if has suffix 00 and in there is no other occurrence of . If , then it is possible to assign a distinct word in to every state of . The definition of is by induction on . , i.e., the only word in is 00. . Given sets , let be:
Hence, , and
This recurrence relation is strictly connected to the so-called Lucas sequence , where 333Beware that is not the set of states. are integers (see, e.g, p. 395 of ): , , and for , . For this is just a Fibonacci sequence. If , a closed-form solution for every is , where . With standard algebraic manipulations and by defining as in the statement of the Lemma, one can derive that: is satified if .
Both and are monotonically decreasing with , although very slowly for large , with , . with, moreover, , . The expression for is . By definition of a code, cannot be smaller then , i.e., is , hence the code of Lemma 1 is asymptotically optimal. In particular, the ratio , where is computed by the above formula, is dominated by term , which is very close to for . Hence, no encoding can significantly improve (or ), decreasing . A few examples of approximated values for , , and are:
To prove Th. 2, a few more definitions are required.
Define the following alphabetic homomorphisms:
, are such that ,
for every .
A path of of length is called a -path. Paths of , , are called consecutive if is also a path of (i.e, , for all ). With an abuse of notation, let be defined on paths as follows. Let be a -path. If then ; if , let be the unique word in such that , (i.e., if is a -path).
If , then there exist a unique and a unique such that ; hence, there exist consecutive paths of , denoted by such that , each , , is a -path and is a -path. This decomposition in consecutive paths is called the canonical decomposition of . Then, is defined as .
Let be the -slt language defined by the following sets:
The proof of the following lemma follows from uniqueness of factor-decodability:
Let be a factor-decodable code. For all , there exist a position , , and two consecutive paths of such that:
is a -path, and ;
for any two consecutive paths of , , if is a -path and is a suffix of then and ;
if for some , then , is a -path and ;
if , then .
There exists a finite language such that .
Sketch of the Proof Let be the set of words in of length less than .
Part (I): . Assume that . To show that there exists a successful path of such that , we first claim the following result for every path, whether successful or not:
The proof of (*) is on induction on the the canonical decomposition of . Part (I) can now be completed. For all , let be a successful path of with ; moreover, let be the canonical decomposition of . By (*), . But is successful: , hence ; , hence . Therefore, .
Part (II): . The proof needs the assumption that code is factor-decodable. The following property can be proved by induction on , by applying Lemma 2:
(+) for all words , , if and then there exists a path of such that and .
The proof of Part (II) follows from (+). In fact, if , with , then there exists such that . Since in this case , , , by (+) there exists a path of with origin in and such that . Let be the canonical decomposition of , with , and (hence ). Let and consider . Apply Lemma 2, Part (1), to . Hence, there exist a position and consecutive , with a -path, such that . Since (of length ) is also a suffix of , by Part (2) of Lemma 2, . Since , also paths are consecutive. Hence, . Therefore, path has label , origin , and end in , i.e., it is successful: .
Hence, by enlarging the local alphabet, a smaller width suffices to construct the slt language. However, it is useless to take an alphabetic ratio , since in this case one can use the simpler construction of Prop. 1. To finish, we note that for many regular languages one can obtain a homomorphic definition that uses lower values of alphabetic ratio and/or width than those obtained by the main theorem.
We have generalized Medvedev’s homomorphic characterization of regular languages: instead of using as generator a local language over a large alphabet, which depends on the complexity of the regular language, we can use a strictly locally testable language over a smaller alphabet that does not depend on complexity, but just on the source alphabet. We have proved that the smallest alphabet one can use in the generator is the double of the alphabet of the regular language; thus, for instance, four symbols suffice to homomorphically generate any regular binary language.
In the main proof we have offered a specific and fairly optimized construction of the strictly locally testable language, for which we have derived the relationship between the width, the alphabetic ratio, and the complexity of the regular language. In our opinion, the construction should be of its own interest, as a new technique for simulating a NFA by means of a larger, yet strictly locally testable, machine. Our encoding is asymptotically optimal with respect to language complexity, and remains very close to the theoretical optimum for finite values of complexity. But it is an open technical question whether a different construction would yield better values for the alphabetic ratio and the width parameter.
Applications and developments of our result are conceivable in areas where a language characterization à la Medvedev has been found valuable, as in the next ones.
Picture languages. A main family of 2-dimensional languages, the tiling systems , is defined by a 2-dimensional Medvedev characterization. Does our result extend to 2D languages?
Context-free languages. Combining our result with the Chomsky-Schutzenberger theorem it should be possible to obtain non-erasing homomorphic characterizations using a small alphabet.
Consensual languages . This generalization of finite-state machines motivated by modelling tightly connected concurrent computations uses homomorphism between words as its core mechanism.
Information transmission for reducing the receiver cost was already mentioned in the introduction.
Acknowledgments: Thanks to Aldo De Luca for suggesting relevant references.
-  A. de Luca and A. Restivo (1980): A Characterization of Strictly Locally Testable Languages and Its Applications to Subsemigroups of a Free Semigroup. Information and Control 44(3), pp. 300–319.
-  J. Berstel & J.E. Pin (1996): Local Languages and the Berry-Sethi Algorithm. Theor. Comput. Sci. 155(2), pp. 439–446, doi:10.1016/0304-3975(95)00104-2.
-  V. Braitenberg (2004): Das Bild der Welt im Kopf: Eine Naturgeschichte des Geistes. LIT Verlag, Muenster.
-  P. Caron (2000): Families of locally testable languages. Theor. Comput. Sci. 242(1-2), pp. 361–376, doi:10.1016/S0304-3975(98)00332-6.
-  S. Crespi Reghizzi & Pierluigi San Pietro (2011): Consensual languages and matching finite-state computations. RAIRO Theor. Informatics and Appl. 45, pp. 77–97.
-  L.E. Dickson (1919): History of the Theory of Numbers. Carnegie Institution of Washington, Online version at: http://www.archive.org/details/historyoftheoryo01dick.
-  P. Garcia & E. Vidal (1990): Inference of K-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 12(9), pp. 920–925, doi:10.1109/34.57687.
-  D. Giammarresi & A. Restivo (1997): Two-dimensional languages. In G. Rozenberg & A. Salomaa, editors: Handbook of formal languages, vol. 3: beyond words, Springer-Verlag New York, Inc., New York, NY, USA, pp. 215–267.
-  T. Head (1987): Formal language theory and DNA: An analysis of the generative capacity of specific recombinant behaviors. Bulletin of Mathematical Biology 49, pp. 737–759, doi:10.1007/BF02481771.
-  J. Rogers and G. Pullum (2011): Aural pattern recognition experiments and the subregular hierarchy. Journ. of Logic Language and Information to appear.
-  R. McNaughton & S. Papert (1971): Counter-free Automata. MIT Press, Cambridge, USA.
-  Y. T. Medvedev (1964): On the class of events representable in a finite automaton. In E. F. Moore, editor: Sequential machines – Selected papers (translated from Russian), Addison-Wesley, New York, NY, USA, pp. 215–227.
-  S. Eilenberg (1974): Automata, Languages, and Machines. Academic Press.
-  W. A. Wickelgren (1969): Context-sensitive coding, associative memory and serial order in (speech) behavior. Psychological Review 76(1), doi:10.1037/h0026823.