Towards a Theory of Complexity of Regular Languages††thanks: This work was supported by the Natural Sciences and Engineering Research Council of Canada grant No. OGP0000871.
We survey recent results concerning the complexity of regular languages represented by their minimal deterministic finite automata. In addition to the quotient complexity of the language – which is the number of its (left) quotients, and is the same as its state complexity – we also consider the size of its syntactic semigroup and the quotient complexity of its atoms – basic components of every regular language. We then turn to the study of the quotient/state complexity of common operations on regular languages: reversal, (Kleene) star, product (concatenation) and boolean operations. We examine relations among these complexity measures. We discuss several subclasses of regular languages defined by convexity. In many, but not all, cases there exist “most complex” languages, languages satisfying all these complexity measures.
Keywords: atom, boolean operation, complexity measure, concatenation, convex language, most complex language, quotient complexity, regular language, reversal, star, state complexity, syntactic semigroup, unrestricted complexity
We study the complexity of regular languages represented by their minimal deterministic finite automata (DFAs). The number of states in the minimal DFA of a language is its state complexity [51, 63]; this number is used as a first measure of complexity. But languages having the same state complexity can be quite simple or very complex. How do we decide whether one language is more complex than another? In this respect, the size of the syntactic semigroup of the language – which is isomorphic to the transition semigroup of its minimal DFA – appears to be a good measure.
Another way to distinguish two regular languages of the same state complexity is by comparing how difficult it is to perform operations on these languages. The state complexity of a regularity preserving unary operation on a language is defined as the maximal complexity of the result of the operation expressed as a function of the state complexity of the language. For example, we know that there are regular languages of state complexity whose reverses have state complexity , but many languages do not meet this bound. For binary operations we have two languages of state complexities and , respectively. The state complexity of a binary operation is the maximal state complexity of the result, expressed as a function of and .
In general, to establish the state complexity of a unary operation, we need to find an upper bound on this complexity and a language for each that meets this bound. This sequence of languages is called a stream. The languages in the stream often have the same structure and differ only in the parameter . For binary operations we need two streams. For some operations the same stream can be used for both operands. However, if the second operand cannot be the same as the first, it can usually be a dialect of the first operand – a language that differs only slightly from the first.
It has been proved  that the stream of regular languages shown in Fig. 1 is most complex because it meets the following complexity bounds: the size of the syntactic semigroup, and the state complexities reversal, (Kleene) star, product/concatenation, and all binary boolean operations. It also has the largest number of atoms (discussed later), and all the atoms have maximal state complexity.
The alphabet of a regular language is (or is a language over ) if and every letter of appears in a word of . In addition to the usual state complexity of binary operations on languages over the same alphabet, unrestricted state complexity on languages over different alphabets has also been studied . By adding an input that induces the identity transformation in the DFA of Fig. 1, we obtain a most complex language that also meets the bounds for unrestricted operations.
A natural question then arises whether most complex language streams also exist in proper subclasses of regular languages. The answer is positive for many, but not all, classes. A rich source of subclasses is provided by the concept of convexity. In this paper we summarize the results for many classes of convex languages.
Many of these results were presented as an invited talk at the 20th International Conference on Developments in Language Theory, Montréal, Québec on July 25, 2016. A short abstract appeared in .
2 Quotient/State Complexity of Regular Languages
Let be a nonempty set, called an alphabet, consisting of letters , . A word over is a sequence , where , ; if , the word is empty and is denoted by . A language over is any subset of , where is the free monoid generated by with as the identity, that is, is the set of all words over . Recall that if is a language over , every letter of appears in at least one word of .
The languages (the empty language) and , (the letter languages) are called basic. A language is regular if it can be constructed from the basic languages using only the operations union (denoted by ), product (concatenation) (denoted by juxtaposition: ), and star (denoted by , where , and ).
If and , the (left) quotient of by is the language ; it is the set of “all words that can follow in ”. It is well known that a language is regular if and only if it has a finite number of distinct quotients [6, 54]. So it is natural to consider the number of quotients of a regular language as a complexity measure, which we call the quotient complexity of and denote by .
Quotients can be computed as follows: For , and we have
When we compute quotients this way, they are represented by expressions involving the basic languages, union, product and star, and it may not be obvious that two different expressions denote the same quotient. However, it is easy to recognize similarity, where two expressions are similar  if one can be obtained from the other using the following rules:
The number of dissimilar expressions of a regular language is always finite .
A concept closely related to a regular language is that of a deterministic finite automaton (DFA), which is a quintuple , where is a finite non-empty set of states, is a finite non-empty alphabet, is the transition function, is the initial state, and is the set of final states. We extend to functions and as usual. A DFA accepts a word if . The set of all words accepted by is the language accepted by , denoted by . If is a state of , then the language of is the language accepted by the DFA . A state is empty if its language is empty. Two states and of are equivalent if . A state is reachable if there exists such that . A DFA is minimal if all of its states are reachable and no two states are equivalent.
The famous theorem of Kleene  states that a language is regular if and only if it is accepted by a DFA. We can derive a DFA accepting a regular language directly from its quotients. Denote the set of quotients of by , where by convention. Each quotient can be represented also as , where is such that . Now define the quotient DFA of as follows: , where if , and . This DFA accepts and is minimal111 If a DFA is constructed using dissimilar expressions and is not minimal, it can be minimized by one of several methods [5, 45, 52], by merging states corresponding to the same expression..
In any DFA , if , then , known also as the right language of , is precisely the quotient . Evidently, the state complexity of a language is equal to its quotient complexity. From now on we refer to the quotient/state complexity of simply as the complexity of .
3 Syntactic/Transition Semigroups
According to our complexity measure any two languages with quotients have the same complexity. But consider the language accepted by the minimal DFA of Fig. 1 and the language . Intuitively is much simpler than .
It was proposed in  that the size of the syntactic semigroup of a language should be used as an additional complexity measure. We proceed to define it now.
The Myhill congruence , also known as the syntactic congruence, of a language is defined on as follows: For ,
The quotient set of equivalence classes of is a semigroup, the syntactic semigroup of . The syntactic complexity of a language is the cardinality of the syntactic semigroup.
Returning to our example, the syntactic complexity of is known to be , whereas that of is ; hence syntactic complexity clearly distinguishes the two languages.
Let be a set of elements. Without loss of generality, we assume . A transformation of is a mapping . The image of under is denoted by . If are transformations of , their composition is defined by . Let be the set of all transformations of ; then is a monoid under composition.
For , a transformation of a set is a -cycle if . This -cycle is denoted by , and it acts as the identity on the states not in the cycle. A 2-cycle is a transposition. A transformation that sends all the states of to and acts as the identity on the remaining states is denoted by . If we write for . The identity transformation is denoted by . The notation denotes a transformation that sends to for and is the identity for the remaining states, and is defined similarly.
Let be a DFA, where we use as the set of states, without loss of generality. Each word induces a transformation of the set defined by ; we denote this by . Sometimes we use the word to denote the transformation it induces; thus we write instead of . We extend the notation to sets of states: if , then . We also write to mean that the image of under is .
The set of all transformations induced by non-empty words forms a semigroup of transformations called the transition semigroup of . This semigroup is generated by . We use the transition semigroup rather than the transition monoid, because the latter always has the identity transformation induced by the empty word, whereas, in the semigroup, if the identity exists it must be induced by a non-empty word. For a more detailed discussion of the necessity of distinguishing between semigroups and monoids see [39, Chapter V], for example.
If is a minimal DFA of , then is isomorphic to the syntactic semigroup of , and we represent elements of by transformations in . We return to syntactic complexity later.
Since quotients play a key role in defining a regular language we should also consider their complexity. In our example of Fig. 1 all quotients have complexity . In the case of the language , the quotients , ,, , , where have complexities , respectively. In general, however, the complexity of quotients is not a very good measure because it is always if the DFA is strongly connected. But to ensure that most complex languages also have most complex quotients, we add the complexities of quotients as one of our measures.
For a regular language and words consider the left congruence:
An atom is a congruence class of ; thus two words and are in the same class if If and is a regular language with quotients , then each subset of defines an atomic intersection , where and for any ; an atom of is a non-empty atomic intersection. It follows that each quotient is a union of atoms, namely of all the atoms in which appears uncomplemented. It is also known that quotients of atoms are unions of atoms . Thus atoms are fundamental components of a language, and it was proposed in  that the quotient complexity of atoms should be considered as a complexity measure of regular languages.
A nondeterministic finite automaton (NFA) is a quintuple , where , and are as in a DFA, , and is the set of initial states. Each triple with , is a transition if . A sequence of transitions, where for is a path in . The word is the word spelled by the path. A word is accepted by if there exists a path with and that spells .
Recall that we have defined the quotient DFA of a regular language using its quotients as states. In an analogous way, we define an NFA called the átomaton222The accent is added to indicate that the word should be pronounced with the stress on the first syllable, and also to avoid confusion between automaton and atomaton. of using atoms as states. The átomaton of is a NFA , where is the set of atoms of ; is the transition function defined by if ; is the set of initial atoms, those atoms in which appears uncomplemented; and is the final atom: the only atom containing . In the átomaton, the right language of state is the atom .
We denote by the reverse of the language . Let be the NFA operation that interchanges the sets of initial and final states and reverses all transitions. Let be the NFA operation that determinizes a given NFA using the subset construction and taking into account only the subsets reachable from the set of initial states. Finally, let be the minimization operation of DFAs. These operations are applied from left to right; thus in the NFA is first reversed, then determinized, then minimized and then reversed again.
The átomaton has the following remarkable properties:
Theorem 5.1 (Átomaton )
Let be a regular language, let be its minimal DFA, and let be its átomaton. Then
is isomorphic to .
is isomorphic to the quotient DFA of .
is isomorphic to .
For any NFA accepting , is isomorphic to .
is isomorphic to if and only if is bideterministic.
A minimal DFA is bideterministic if its reverse is also a DFA. A language is bideterministic if its quotient DFA is bideterministic.
The quotient complexity of atoms of was computed in  using the átomaton. To find the complexity of atom , the átomaton started in state was converted to an equivalent DFA by the subset construction. A more direct and simpler method was used in  where the DFA accepting an atom of a given language is constructed directly from the DFA of the language.
It was shown in  that the language of Fig. 1 has atoms , and each such atom meets the upper bound for the quotient complexity. On the other hand, the language has atoms: . Therefore has only atoms, and its most complex atom has complexity . Hence atom complexity does distinguish well between and . More will be said about atom complexity later.
The following property of the quotient complexity of atoms was proved by Diekert and Walter . Let be a language of quotient complexity , and let be the maximal quotient complexity of its atoms. Then approaches 3 as approaches infinity.
6 Quotient Complexity of Operations
Many software systems have the capability of performing operations on regular languages represented by DFAs. For such systems it is necessary to know the maximal size of the result of the operation, to have some idea how long the computation will take and how much memory will be required. A lower bound on these time and space complexities is provided by the quotient/state complexity of the result of the operation. For example, suppose we need to reverse a language . We apply the reversal operation to a minimal DFA of and then use the subset construction to determinize . Since there are at most reachable subsets, we know that is an upper bound on the state complexity of reversal. Because we know that this bound can be reached, is a lower bound on the the time and space complexities of reversal.
From now on we denote a language of complexity by , and a DFA with states, by . In general, the complexity of a regularity-preserving unary operation on regular languages is the maximal value of as a function of , where varies over all regular languages with complexity . To show that the bound is tight we need to exhibit a sequence , called a stream, of languages that meet this bound. The stream does not necessarily start from 1, because the bound may not be reachable for small values of . In the case of reversal, the stream of Fig. 1 happens to meet the bound for .
In the case of star, Maslov  stated without proof that the tight upper bound for its complexity is . A proof was provided by Yu, Zhuang and Salomaa . This bound is met by the DFA of Fig. 1 for .
Next consider the product of two languages and . Maslov stated without proof that the tight upper bound for product is , and that this bound can be met. Yu, Zhuang and Salomaa  showed that there always exists a DFA with at most states that accepts , and proved that the bound can be met. This bound is also met by and of Fig. 1 for .
In general, the complexity of a regularity-preserving binary operation on regular languages of complexities and , respectively, is the maximal value of the result of the operation as a function of and , where the operands vary over all regular languages of complexities and , respectively. Thus we need two families and of languages meeting this bound; the notation and implies that and depend on both and . Two such examples are known : the union and intersection of finite languages require such witnesses. However, in all other cases studied in the literature, it is enough to use witness streams and , where is independent of and is independent of .
So far we have seen that the stream of Fig. 1 meets the upper bounds for syntactic complexity, quotients, atoms, reversal, star, and product. The situation is a little different for union (and other binary boolean operations). Since can have at most quotients, we have an upper bound. Moreover, for , we know  that the complexity of , where these languages are defined in Fig. 1, does meet the bound . But because , the complexity of union for the languages of Fig. 1 is instead of . So the same stream cannot be used for both arguments. However, it is possible to use a stream that “differs only slightly” from of Fig. 1.
The notion “differs only slightly” is defined as follows [8, 14, 26]. Let be an alphabet ordered as shown; if , we denote it by to stress its dependence on . A dialect of is a language related to and obtained by replacing or deleting letters of in the words of . More precisely, for an alphabet and a partial map , we obtain a dialect of by replacing each letter by in every word of , or deleting the word entirely if is undefined. We write to denote the dialect of given by , and we denote undefined values of by “”. For example, if then its dialect is the language . Undefined values for letters at the end of the alphabet are omitted; thus, for example, if , , , and , we write for .
In general, for any binary boolean operation on languages and with quotient DFAs and , to find we use the direct product of and and assign final states in the direct product according to the operation . This gives an upper bound of for all the operations. If we know that the bound is met by , we also know that the intersection meets that bound, because for all ; similarly, the difference meets that bound. It is also known that there are witnesses and such that the symmetric difference meets the bound . A binary boolean function is proper if it depends on both of its arguments. There are six more proper boolean functions: , , , , , and . Thus witnesses for these six functions can be found using the witnesses for union and symmetric difference and their complements.
Our discussion so far, as well as all the literature prior to 2016, used witnesses restricted to the same alphabet. However it is also useful to perform binary operations on languages over different alphabets, for example: or . The unrestricted complexity of binary operations was first studied in . In the case of union and symmetric difference of and , the result is a language over the alphabet . To compute the complexity of , if does not have an empty quotient, we add an empty state to and send all transitions under letters from to that state. Similarly, we add an empty state if needed to and send all transitions under letters from to that state. Thus we have now two languages over the alphabet , and we proceed as in the restricted case over the larger alphabet. It turns out that the complexity of union and symmetric difference is .
For difference and intersection, is still an upper bound on their complexity. However, the alphabet of is and the complexity turns out to be for the difference operation. Similarly, the alphabet of is , and the complexity of intersection is , as in the restricted case. The complexity of any other binary boolean operation can be determined from the complexities of union, intersection, difference and symmetric difference; however, the complexity of may differ by 1 from the complexity of . For more details see .
7 Complexity Measures
The size of the syntactic semigroup of .
The complexity of the quotients of .
The number of atoms of .
The complexity of the atoms of .
The complexity of the reverse of .
The complexity of , the star of .
The restricted and unrestricted complexities of the product .
The restricted and unrestricted complexities of boolean operations .
These measures are not all independent: the relations described below are known.
Theorem 7.1 (Semigroup and Reversal )
Let be a minimal DFA with states accepting a language . If the transition semigroup of has elements, then the complexity of is .
Theorem 7.2 (Number of Atoms and Reversal )
The number of atoms of a regular language is equal to the complexity of .
Before discussing the next relationships we need to introduce certain concepts from group theory. If is a permutation group, is transitive on a set if for all , there exists such that . Also, is -set-transitive if it is transitive on the set of -subsets of , that is, if for all such that , there exists such that . If has degree and is -set-transitive for , then is set-transitive.
Set transitive groups have been characterized as follows:
Theorem 7.3 (Set Transitive Groups )
A set-transitive permutation group of degree is or or a conjugate of one of the following permutation groups:
For , the affine general linear group .
For , the projective general linear group .
For , the projective special linear group .
For , the projective semilinear group .
We say L is maximally atomic if it has the maximal number of atoms, and each of those atoms has the maximal possible complexity. The rank of a transformation is the cardinality of . The next result characterizes maximally atomic languages.
Theorem 7.4 (Maximally Atomic Languages )
Let be a regular language over with complexity , and let be the transition semigroup of the minimal DFA of . Then is maximally atomic if and only if the subgroup of permutations in is set-transitive, and contains a transformation of rank .
Define the following classes of languages:
FTS - languages whose minimal DFAs have the full transformation semigroup of elements.
STS - languages whose minimal DFAs have transition semigroups with a set-transitive subgroup of permutations and a transformation of rank .
MAL - maximally atomic languages.
MNA - languages with the maximal number of atoms.
MCR - languages with a maximally complex reverse.
The known relations among the various complexity measures are thus as follows:
|FTS STS = MAL MNA = MCR|
8 Most Complex Regular Language Streams
We now exhibit a regular language stream that, together with some dialects, meets the upper bounds for all complexity measures we have discussed so far [8, 24]. In this sense this is a most complex regular language stream or a universal witness stream. This stream differs from the stream of Fig. 1 only by the identity input .
For , let , where , and is defined by the transformations , , , and . Let be the language accepted by .
Theorem 8.1 ( Most Complex Regular Languages)
For each , the DFA of Definition 1 is minimal and its language has complexity . The stream with some dialect streams is most complex in the class of regular languages. In particular, it meets all the complexity bounds below, which are maximal for regular languages. In several cases the bounds can be met with a reduced alphabet.
The syntactic semigroup of has cardinality , and at least three letters are required to meet this bound.
Each quotient of has complexity .
The reverse of has complexity , and has atoms.
For each atom of , the complexity satisfies: ; , if .
The star of has complexity .
Restricted: For any proper binary boolean operation , .
Unrestricted: if , , and .
At least four letters are necessary for unrestricted operations .
In the stream above we have used a “master language” of Definition 1 with four letters, and dialects that use the same alphabet as the master language. The stream below uses only three letters in the master language of Definition 2, but then adds an extra letter in a dialect.
For , let , where , and is defined by the transformations , , and . Let be the language accepted by . The structure of is shown in Fig. 2.
The properties of are the same as those in Theorem 8.1 except for the following:
The bound for the restricted product is met by .
The bound for the unrestricted product is met by .
The bound for the unrestricted union and symmetric difference is met by .
The bound for the unrestricted difference is met by .
The most complex streams introduced in this section will be used in several subclasses of regular languages.
9 Most Complex Languages in Subclasses
Many interesting proper subclasses of the class of regular languages can be defined using the notion of convexity. Convex languages were introduced in 1973 by Thierrin  and revisited in 2009 by Ang and Brzozowski .
Convexity can be defined with respect to any binary relation on . Let be such a binary relation; if and , we write . Let be the converse binary relation, that is, let if and only if . A language is -convex if , , and with imply . It is -free if and imply . It is -closed if and imply . It is -closed if and imply . Languages that are -closed are also called -converse-closed. One verifies that a language is -closed if and only if its complement is -closed.
If , where , then is a prefix of , is a factor of , and is a suffix of . Note that a prefix or a suffix is also a factor. If , where , and , then is a subword of ; note that every factor of is a subword333The word “subword” is often used to mean “factor”; here by a “subword” we mean a subsequence. of .
The shuffle of words is defined as follows:
The shuffle of two languages and over is defined by
Note that the shuffle operation is commutative on both words and languages.
Here we consider only four binary relations for defining convexity: “is a prefix of”, “is a suffix of”, “is a factor of”, and “is a subword of”. Each of these four relations is a partial order on and leads to four classes of languages; we illustrate this using the prefix relation:
A language that is prefix-converse-closed is a right ideal, that is, it satisfies the equation .
A language that is prefix-closed is the complement of a right ideal.
A language that is prefix-free and not is a prefix-code .
A language is proper prefix-convex if it not a right ideal and is neither closed nor free.
Similarly, we define suffix-converse-closed languages which are left ideals (satisfy ), suffix-closed, suffix-free (suffix codes ), and proper suffix-convex languages, two-sided ideals (that satisfy ), factor-closed, factor-free (infix codes ), and proper factor-convex languages, and also subword-converse-closed languages which are all-sided ideals (that satisfy ), subword-closed, subword-free (hypercodes ), and proper subword-convex languages.
Decision problems for convex languages were studied in . We can decide in time if a given regular language over a fixed alphabet accepted by a DFA with states is prefix-, suffix-, factor-, and subword-convex. We can decide in time if is prefix-free, left ideal, suffix-closed, suffix-free, two-sided ideal, factor-closed, factor-free, all-sided ideal, subword-closed, subword-free. We can decide in time if is a right ideal or a prefix-closed language.
We now consider the complexity properties of some convex languages.
9.1 Prefix-Convex Languages
RIGHT IDEALS The complexity of right ideals was studied as follows: complexities of common operations using various witnesses , semigroup size , complexities of atoms , most complex right ideals with restricted operations , most complex right ideals with restricted and unrestricted operations and four-letter witnesses , most complex right ideals with restricted and unrestricted operations and five-letter witnesses . Here we use the witnesses from .
For , let , where and is defined by , , , and . This DFA uses the structure of Fig. 2 for the states in and letters in . Let be the language of .
Theorem 9.1 (Most Complex Right Ideals)
For each , the DFA of Definition 3 is minimal and is a right ideal of complexity . The stream with some dialect streams is most complex in the class of right ideals. It meets the following bounds: 1. Semigroup size: . 2. Quotient complexities: , except . 3. Reversal: . 4. Atom complexities: ; , if . 5. Star: . 6. (a) Restricted product: ; (b) Unrestricted product: . 7. (a) Restricted boolean operations: if , if , and if . (b) Unrestricted boolean operations: same as regular languages. At least four letters are required to meet all these bounds .
PREFIX-CLOSED LANGUAGES The complexities of common operations on prefix-closed languages using various witnesses were studied in [17, 43]. Most complex prefix-closed languages were examined in . As every prefix-closed language has an empty quotient, the restricted and unrestricted complexities are the same.
For , let , where , and is defined by , , , and . Let be the language of .
Theorem 9.2 (Most Complex Prefix-Closed Languages)
For , the DFA of Definition 4 is minimal and is a prefix-closed language of complexity . The stream with some dialect streams is most complex in the class of prefix-closed languages, and meets the following bounds: 1. Semigroup size: . 2. Quotient complexities: , except . 3. Reversal: . 4. Atom complexities: , if ; , if . 5. Star . 6. Product: . 7. Boolean operations: if , if , and if . At least four letters are required to meet all these bounds .
PREFIX-FREE LANGUAGES The complexities of operations on prefix-free languages with various witnesses were studied in [43, 47, 50]. The syntactic complexity bound of was established in . Most complex prefix-free languages were considered in . As every prefix-free language has an empty quotient, the restricted and unrestricted complexities are the same for binary operations.
For , let and let DFA be where is defined by , , , , for . The transformations induced by and coincide when . This DFA uses the structure of the DFA of Fig. 2 for the states in and letters in . Let be the language of .
Theorem 9.3 (Most Complex Prefix-Free Languages)
For , the DFA of Definition 5 is minimal and is a prefix-free language of complexity . The stream with some dialect streams is a most complex prefix-free language. At least inputs are required to meet all the bounds below : 1. Semigroup size: . 2. Quotient complexities: , except , . 3. Reversal: . 4. Atom complexities: , if ; , if ; , if ; , if . 5. Star: . 6. Product: . 7. Boolean operations: if , if , and if .
PROPER PREFIX- CONVEX LANGUAGES Proper prefix-convex languages were studied in . In contrast to the three special cases, they represent the full nature of prefix-convexity.
For , , let where