Unrestricted State Complexity of Binary Operations on Regular and Ideal Languages

# Unrestricted State Complexity of Binary Operations on Regular and Ideal Languages

[ David R. Cheriton School of Computer Science, University of Waterloo,
J. Brzozowski]brzozo@uwaterloo.ca C. Sinnamon]sinncore@gmail.com
[
###### Abstract

We study the state complexity of binary operations on regular languages over different alphabets. It is known that if and are languages of state complexities and , respectively, and restricted to the same alphabet, the state complexity of any binary boolean operation on and is , and that of product (concatenation) is . In contrast to this, we show that if and are over different alphabets, the state complexity of union and symmetric difference is , that of difference is , that of intersection is , and that of product is . We also study unrestricted complexity of binary operations in the classes of regular right, left, and two-sided ideals, and derive tight upper bounds. The bounds for product of the unrestricted cases (with the bounds for the restricted cases in parentheses) are as follows: right ideals (); left ideals (); two-sided ideals (). The state complexities of boolean operations on all three types of ideals are the same as those of arbitrary regular languages, whereas that is not the case if the alphabets of the arguments are the same. Finally, we update the known results about most complex regular, right-ideal, left-ideal, and two-sided-ideal languages to include the unrestricted cases.

Keywords: boolean operation, concatenation, different alphabets, left ideal, most complex language, product, quotient complexity, regular language, right ideal, state complexity, stream, two-sided ideal, unrestricted complexity

\runningauthors

J. Brzozowski, C. Sinnamon

WAT,NSERC]Janusz Brzozowski thanks: [NSERC]This work was supported by the Natural Sciences and Engineering Research Council of Canada grant No. OGP0000871.

WAT,NSERC]Corwin Sinnamon

## 1 Motivation

Formal definitions are postponed until Section 2.

The first comprehensive paper on state complexity was published in 1970 by A. N. Maslov [20], but this work was unknown in the West for many years. Maslov wrote:

An important measure of the complexity of [sets of words representable in finite automata] is the number of states in the minimal representing automaton. … if are representable in automata and with and states respectively …, then:

1. is representable in an automaton with states;

2. is representable in an automaton with states.

In this formulation these statements are false: we will show that union may require states and product (concatenation), states. However, Maslov must have had in mind languages over the same alphabet, in which case the statements are correct.

The second comprehensive paper on state complexity was published by S. Yu, Q. Zhuang and K. Salomaa [24] in 1994. Here the authors wrote:

1. … for any pair of complete -state DFA and -state DFA defined on the same alphabet , there exists a DFA with at most states which accepts .

2. states are … sufficient … for a DFA to accept the intersection (union) of an -state DFA language and an -state DFA language.

The first statement includes the same-alphabet restriction, but the second omits it (presumably it is implied by the context). Here DFA stands for deterministic finite automaton, and complete means that there is a transition from every state under every input letter.

After these two papers appeared many authors studied the state complexity of various operations in various classes of regular languages, always using witnesses restricted to the same alphabet. However, we point out that the same-alphabet restriction is unnecessary: there is no reason why we should not compute the union or product of two languages over different alphabets. In fact, the software package Grail, for instance, (http://www.csit.upei.ca/theory/) allows the user to calculate the result of these operations.

As an example, let us consider the union of languages and accepted by the minimal complete two-state automata and of Figure 1, where an incoming arrow denotes the initial state and a double circle represents a final state.

The union of and is a language over three letters. To find the DFA for , we view and as incomplete DFAs, the first missing all transitions under , and the second, under . After adding the missing transitions we obtain DFAs and shown in Figure 2. Now we can proceed as is usually done in the same-alphabet approach, and use the direct product of and to find . Here it turns out that six states are necessary to represent , but the state complexity of union is actually .

In general, when calculating the result of a binary operation on regular languages with different alphabets, we deal with special incomplete DFAs that are only missing some letters and all the transitions caused by these letters. The complexity of incomplete DFAs has been studied previously by Gao, K. Salomaa, and Yu [15] and by Maia, Moreira and Reis [19]. However, the objects studied there are arbitrary incomplete DFAs, whereas we are interested only in complete DFAs with some missing letters. Secondly, we study state complexity, whereas the above-mentioned papers deal mainly with transition complexity. Nevertheless, there is some overlap. It was shown in [15, Corollary 3.2] that the incomplete state complexity of union is less than or equal to , and that this bound is tight in some special cases. In [19, Theorem 2], witnesses that work in all cases were found. These complexities correspond to our result for union in Theorem 3.2. Also in [19, Theorem 5], the incomplete state complexity of product is shown to be , and this corresponds to our result for product in Theorem 3.1.

In this paper we remove the restriction of equal alphabets of the two operands. We prove that the complexity of union and symmetric difference is , that of difference is , and that of intersection is , and that of the product is , if each language’s own alphabet is used. We exhibit a new most complex regular language that meets the complexity bounds for restricted and unrestricted boolean operations, restricted and unrestricted products, star, and reversal, has a maximal syntactic semigroup and most complex atoms. All the witnesses used here are derived from that one most complex language.

A much shorter version of this paper appeared in [5]. That paper dealt only with unrestricted product and binary boolean operations on regular languages. Here we include a shorter proof of the theorem about unrestricted product of regular languages, and establish the unrestricted complexities of product and binary boolean operations on right, left and two-sided ideals.

## 2 Terminology and Notation

If is a finite alphabet and , the alphabet of is the set . A basic complexity measure of with alphabet is the number of distinct (left) quotients of by words in , where a (left) quotient of by a word is . The number of quotients of is its quotient complexity [3], .

Unless otherwise specified, for a regular language with alphabet we define the complement of by . With this definition it is not always true that as and may have different alphabets. For example, if , then while , for the alphabet of is instead of . There is only one way for this to occur: In order for their alphabets to be different, there must be a letter in the alphabet of such that every word containing the letter is in the language, so that the letter is not present in . Hence we have , and it is usually easy to determine the complexity of when presented with a specific language .

Let be regular language with quotient complexity , let be a unary operation on languages, and let be the result of the operation. The quotient complexity of the operation is the maximal value of as a function of , as ranges over all regular languages with quotient complexity .

Let and be regular languages of quotient complexities and that have alphabets and , respectively, let be a binary operation on languages, and let be the result of the operation. The quotient complexity of is the maximal value of as a function of and , as and range over all regular languages of quotient complexities and , respectively.

A deterministic finite automaton (DFA) is a quintuple , where is a finite non-empty set of states, is a finite non-empty alphabet, is the transition function, is the initial state, and is the set of final states. We extend to a function as usual. A DFA accepts a word if . The language accepted by is denoted by . If is a state of , then the language of is the language accepted by the DFA . A state is empty (or dead or a sink state) if its language is empty. Two states and of are equivalent if . A state is reachable if there exists such that . A DFA is minimal if all of its states are reachable and no two states are equivalent. Usually DFAs are used to establish upper bounds on the complexity of operations, and also as witnesses that meet these bounds.

The state complexity [24] of a regular language is the number of states in a complete minimal DFA with alphabet which recognizes the language. This concept is equivalent to quotient complexity of . For example, the state complexity of the language is one. There is a two-state minimal DFA with alphabet accepting , but its alphabet is not .

Since we do not use any other measures of complexity in this paper (with the exception of one mention of time and space complexity in this paragraph), we refer to quotient/state complexity simply as complexity. The quotient/state complexity of an operation gives a worst-case lower bound on the time and space complexities of the operation. For this reason it has been studied extensively; see [3, 4, 23, 24] for additional references.

If for a state and a letter , we say there is a transition under from to in . The DFAs defined above are complete in the sense that there is exactly one transition for each state and each letter . If there is at most one transition for each and , the automaton is an incomplete DFA.

A nondeterministic finite automaton (NFA) is a 5-tuple , where , and are defined as in a DFA, is the transition function, and is the set of initial states. An -NFA is an NFA in which transitions under the empty word are also permitted.

To simplify the notation, without loss of generality we use as our basic set of elements. A transformation of is a mapping . The image of under is denoted by . For , a transformation (permutation) of a set is a -cycle if . This -cycle is denoted by , and acts as the identity on the states in . A 2-cycle is called a transposition. A transformation that changes only one state to a state and acts as the identity for the other states is denoted by . The identity transformation is denoted by . If are transformations of , their composition when applied to is defined by . The set of all transformations of is a monoid under composition.

We use as the set of states of every DFA with states, and 0 as the initial state. In any DFA each induces a transformation of defined by ; we denote this by . For example, when defining the transition function of a DFA, we write to mean that , where the transformation acts on state as follows: if is 0 it maps it to 1, if is 1 it maps it to 0, and it acts as the identity on the remaining states.

By a slight abuse of notation we use the letter to denote the transformation it induces; thus we write instead of . We extend the notation to sets of states: if , then . We also find it convenient to write to indicate that the image of under is .

We extend these notions to arbitrary words. For each word , the transition function induces a transformation of by : for all , The set of all such transformations by non-empty words is the transition semigroup of under composition [22].

The Myhill congruence  [21] (also known as the syntactic congruence) of a language is defined on as follows: For if and only if for all The quotient set of equivalence classes of is a semigroup, the syntactic semigroup of .

If is a minimal DFA of , then is isomorphic to the syntactic semigroup of  [22], and we represent elements of by transformations in . The size of this semigroup has been used as a measure of complexity [4, 13, 16, 18].

The atom congruence is a left congruence defined as follows: two words and are equivalent if if and only if for all . Thus and are equivalent if if and only if . An equivalence class of this relation is called an atom of  [12, 17]. It follows that an atom is a non-empty intersection of complemented and uncomplemented quotients of . The number of atoms and their quotient complexities are possible measures of complexity of regular languages [4]. For more information about atoms and their complexity, see [11, 12, 17].

A sequence , of regular languages is called a stream; here is usually some small integer, and the languages in the stream usually have the same form and differ only in the parameter . For example, is a stream. To find the complexity of a binary operation we need to find an upper bound on this complexity and two streams and of languages meeting this bound. In general, the two streams are different, but there are many examples where “differs only slightly” from ; such a language is called a dialect [4] of , and is defined below.

Let be an alphabet; we assume that its elements are ordered as shown. Let be a partial permutation of , that is, a partial function where , for which there exists such that is bijective when restricted to and undefined on . We denote undefined values of by “”, that is, we write , if is undefined at .

If , we denote it by to stress its dependence on . If is a partial permutation, let be the language obtained from by the substitution defined as follows: for , if is defined, and otherwise. The permutational dialect, or simply dialect, of defined by is the language .

Similarly, let be a DFA; we denote it by to stress its dependence on . If is a partial permutation, then the permutational dialect, or simply dialect, of is obtained by changing the alphabet of from to , and modifying so that in the modified DFA induces the transformation induced by in the original DFA. One verifies that if the language is accepted by DFA , then is accepted by .

If the letters for which is undefined are at the end of the alphabet , then they are omitted. For example, if and , , and , then we write for , etc.

A most complex stream of regular language is one that, together with some dialect streams, meets the complexity bounds for all boolean operations, product, star, and reversal, and has the largest syntactic semigroup and most complex atoms. In looking for a most complex stream we try to use the smallest possible alphabet sufficient to meet all the bounds. Most complex streams are useful in systems dealing with regular languages and finite automata. One would like to know the maximal sizes of automata that can be handled by the system. In view of the existence of most complex streams, one stream can be used to test all the operations.

## 3 Regular Languages

The DFA of Definition 3 will be used for both product and boolean operations on regular languages; this DFA is the 4-input DFA called in [4], where it was shown that is a “universal witness”, that is, is a most complex regular stream for all common restricted operations. We now prove that (renamed below), together with some of its permutational dialects, is most complex for both restricted and unrestricted operations.

{definition}

For , let , where , and is defined by the transformations , , , and . Let be the language accepted by . The structure of is shown in Figure 3.

### 3.1 Product of Regular Languages

{theorem}

[(Product of Regular Languages)] For , let (respectively, ) be a regular language with (respectively, ) quotients over an alphabet , (respectively, ). Then , and this bound is met by and of Definition 3. {proof} First we derive the upper bound. Let and be minimal DFAs of arbitrary regular languages and , respectively. We use the normal construction of an -NFA to recognize , by introducing an -transition from each final state of to the initial state of , and changing all final states of to non-final. This is illustrated in Figure 4, where is the only final state of . We then determinize using the subset construction to get the DFA for .

Suppose has final states, where . We will show that can have only the following types of states: (a) at most states , where , and , (b) at most states , where and , and (c) at most states . Because is deterministic, there can be at most one state of in any reachable subset. If , it may be possible to reach any subset of states of along with , and this accounts for (a). If , then the set must contain and possibly any subset of , giving (b). It may also be possible to have any subset of by applying an input that is not in to to get , and so we have (c). Altogether, there are at most reachable subsets. This expression reaches its maximum when , and so we have at most states in .

We prove that the bound is met by the witnesses of Figure 4. We use the following result to show that all the states in the subset construction are reachable.

Suppose is a minimal DFA of , , and is a minimal DFA of . Moreover, assume that the transition semigroups of and are groups.

{lemma}

[(Sylvie Davies, personal communication)] If all the sets of the form are reachable, then so are all sets of the form

 {p′}∪S,p′∈Q′m∖{f′},S⊆Qn and {f′,0}∪S,S⊆Qn∖{0}.(∗)

We now prove that the conditions of the lemma apply to our case. The initial state in the subset automaton is , state is reached by if , and is reached by . Also, is reached by .

• If is odd, from we reach , , by words in .

• If is even, from we reach with odd by words in .

• From we reach by .

• From we reach with even by .

Since and are reachable, so are all the sets of form by the Lemma.

For distinguishability, note that only state accepts in . Hence, if two states of the product have different sets and and , then they can be distinguished by . State is distinguished from by . If , states and are distinguished as follows:Use to reach from and from . The reached states are distinguishable since they differ in their subsets of .

### 3.2 Boolean Operations on Regular Languages

Suppose , where is some universal set. A binary operation is boolean if, for any , whether is included in depends only on the membership of in and . Thus there are sixteen binary boolean operations, corresponding to the number of ways of filling out the truth table below.

T T -
T F -
F T -
F F -

A boolean operation is proper if it is not constant and does not depend on only one variable. There are ten proper boolean operations, given below.

 L′m ∪Ln ¯¯¯¯¯¯¯L′m ∩¯¯¯¯¯¯Ln ¯¯¯¯¯¯¯L′m ∪Ln L′m∩¯¯¯¯¯¯Ln =L′m∖Ln L′m ⊕Ln L′m ∪¯¯¯¯¯¯Ln ¯¯¯¯¯¯¯L′m∩Ln =Ln∖L′m L′m ⊕¯¯¯¯¯¯Ln ¯¯¯¯¯¯¯L′m ∪¯¯¯¯¯¯Ln L′m ∩Ln

Although the complement of a regular language is usually taken with respect to , where is the alphabet of , the list above requires that denotes the complement of in a specific universal set which contains both and . We wish for to be the set of all strings over some alphabet, and it is most natural to have , where is the alphabet of . Thus, contrary to its usual meaning, every use of complement in the list of operations above is taken with respect to .

We study the complexities of four proper boolean operations only: union (), symmetric difference (), difference (), and intersection (). From these four it is generally a straightforward exercise to deduce the complexity of any other operation: The complexity of is determined by symmetry with , and from De Morgan’s laws we have , , , , and . As discussed in Terminology and Notation, and differ by at most 1 for any regular language , and for any specific witness one can easily determine the discrepancy; for this reason we leave it as an exercise to verify that our witnesses meet the upper bounds for all ten proper operations based on the four operations that we address explicitly.

It turns out that the witnesses that we used for unrestricted product also work for unrestricted boolean operations.

{theorem}

[(Boolean Operations on Regular Languages)] For , let (respectively, ) be a regular language with (respectively, ) quotients over an alphabet , (respectively, ). Then the complexity of union and symmetric difference is and this bound is met by and ; the complexity of difference is , and this bound is met by and ; the complexity of intersection is and this bound is met by and .

{proof}

Let and be minimal DFAs for arbitrary regular languages and with and quotients, respectively. To calculate an upper bound for the boolean operations assume that and are non-empty; this assumption results in the largest upper bound. We add an empty state to to send all transitions under the letters from to that state; thus we get an -state DFA . Similarly, we add an empty state to to get . Now we have two DFAs over the same alphabet, and an ordinary problem of finding an upper bound for the boolean operations on two languages over the same alphabet, except that these languages both have empty quotients. It is clear that is an upper bound for all four operations, but it can be improved for difference and intersection. Consider the direct product of and .

For difference, all states of that have the form , where are empty. Hence the bound can be reduced by states to . However, the empty states can only be reached by words in and the alphabet of is a subset of ; hence the bound is reduced futher to .

For intersection, all states , , and all states , , are equivalent to the empty state , thus reducing the upper bound to . Since the alphabet of is a subset of , these empty states cannot be reached and the bound is reduced to .

To prove that the bounds are tight, we start with of Definition 3. For , let be the dialect of where plays the role of and the alphabet is restricted to , and let be the dialect of in which and are permuted, and the alphabet is restricted to ; see Figure 5.

We complete the two DFAs by adding empty states, and then construct the direct product of the new DFAs as illustrated in Figure 6.

If we restrict both DFAs to the alphabet , we have the usual problem of determining the complexity of two DFAs over the same alphabet. By [2, Theorem 1], all states of the form , , , are reachable and pairwise distinguishable by words in for all proper boolean operations if . For our application, the three exceptional cases were verified by computation.

To prove that the remaining states are reachable, observe that and , for . Symmetrically, and , for . Finally, , and all states of the direct product are reachable.

It remains to verify that the appropriate states are pairwise distinguishable. From [2, Theorem 1], we know that all states in are distinguishable. Let , and . For the operations consider four cases:

Union

The final states of are , and . Every state in accepts a word with a , whereas no state in accepts such words. Similarly, every state in accepts a word with a , whereas no state in accepts such words. Every state in accepts a word with a and a word with a . State accepts no words at all. Hence any two states chosen from different sets (the sets being , , , and ) are distinguishable. States in are distinguishable by words in and those in , by words in . Therefore all states are pairwise distinguishable.

Symmetric Difference

The final states here are all the final states for union except . The rest of the argument is the same as for union.

Difference

Here the final states are . The states of the form , , are now equivalent to the empty state . The remaining states are non-empty as each accepts a word in . The states of are pairwise distinguishable by words in . A state is distinguished from by , unless . If , they are distinguished by a word in that maps to , for this word must send to . Hence we have distinguishable states. However, the alphabet of is , and the empty state can only be reached by . Since this empty state is not needed, neither is , and the final bound is ; it is reached by and .

Intersection

Here only is final and all states , , and , are equivalent to , leaving distinguishable states. However, the alphabet of is , and so the empty state cannot be reached. This gives the final bound of states, and this bound is met by and as was already known in [4].

{remark}

In the restricted case the complexity of every one of the ten binary boolean functions in . In the unrestricted case one verifies that we have

 κ(L′m∪Ln)=κ(¯¯¯¯¯¯¯L′m∩¯Ln)=κ(L′m⊕Ln)=κ(L′m⊕¯¯¯¯¯¯Ln)=(m+1)(n+1),
 κ(¯¯¯¯¯¯¯L′m∪Ln)=mn+m+1,κ(L′m∪¯¯¯¯¯¯Ln)=mn+n+1,