Decision Problems For Convex Languages

Decision Problems For Convex Languages

Janusz Brzozowski, Jeffrey Shallit, and Zhi Xu
School of Computer Science
University of Waterloo
Waterloo, ON N2L 3G1
{brzozo,shallit,z5xu}@uwaterloo.ca
Abstract

In this paper we examine decision problems associated with various classes of convex languages, studied by Ang and Brzozowski (under the name “continuous languages”). We show that we can decide whether a given language is prefix-, suffix-, factor-, or subword-convex in polynomial time if is represented by a DFA, but that the problem is PSPACE-hard if is represented by an NFA. In the case that a regular language is not convex, we prove tight upper bounds on the length of the shortest words demonstrating this fact, in terms of the number of states of an accepting DFA. Similar results are proved for some subclasses of convex languages: the prefix-, suffix-, factor-, and subword-closed languages, and the prefix-, suffix-, factor-, and subword-free languages.

1 Introduction

Thierrin [11] introduced convex languages with respect to the subword relation. Ang and Brzozowski [2] generalized this concept to arbitrary relations. For example, a language is said to be prefix-convex if, whenever with a prefix of , then any word must also be in if is a prefix of and is a prefix of . Similar definitions hold for suffix-, factor-, and subword-convex languages. (In this paper, a “factor” is a contiguous block inside another word, while a “subword” need not be contiguous. In the literature, these concepts are sometimes called “subword” and “subsequence”, respectively.)

A language is said to be prefix-free if whenever , then no proper prefix of is in . (By proper we mean a prefix of other than itself.) Prefix-free languages (prefix codes) were studied by Berstel and Perrin [4]. Han has recently considered -free languages for various values of , such as prefix, suffix, factor and subword [7].

A language is said to be prefix-closed if whenever , then every prefix of is also in . Analogous definitions hold for suffix-, factor-, and subword-closed languages. A factor-closed language is often called factorial.

In this paper we consider the computational complexity of testing whether a given language has the property of being prefix-convex, suffix-convex, etc., prefix-closed, suffix-closed, etc., for a total of 12 different problems. As we will see, the computational complexity of these decision problems depends on how the language is represented. If it is represented as the language accepted by a DFA, then the decision problem is solvable in polynomial time. On the other hand, if it is represented as a regular expression or an NFA, then the decision problem is PSPACE-complete. We also consider the following question: given that a language is not prefix-convex, suffix-convex, etc., what is a good upper bound on the shortest words (shortest witnesses) demonstrating this fact?

The remainder of the paper is structured as follows. In Section 2 we study the complexity of testing for convexity for languages represented by DFA’s, and we include testing for closure and freeness as special cases. In Section 3 we exhibit shortest witnesses to the failure of the convexity property. Convex languages specified by NFA’s are studied in Section 4. We also briefly consider convex languages specified by context-free grammars in Section 5. Section 6 concludes the paper.

2 Deciding convexity for DFA’s

We will show that, if a regular language is represented by a DFA with states, it is possible to test the property of prefix-, suffix-, factor-, and subword-convexity efficiently. More precisely, we can test these properties in time.

Let be one of the four relations prefix, suffix, factor, or subword. The basic idea is as follows: is not -convex if and only if there exist words , , such that . Given , we create an NFA- with states and transitions that accepts the language

 {w∈L(M) : there exist u∈L(M),v∉L(M) such that u⊴v⊴w}.

Then if and only if is -convex. We can test the emptiness of using depth-first search in time linear in the size of . This gives an algorithm for testing the -convex property.

Since the constructions for all four properties are similar, in the next subsection we handle the hardest case (factor-convexity) in detail. In the following subsections we content ourselves with a brief sketch of the necessary constructions.

2.1 Factor-convexity

Suppose is a DFA accepting the language , and suppose has states. We now construct an NFA- such that

 L(M′) = {w∈Σ∗ : there exist u,v∈Σ∗ % such that u is a factor of v, v is a factor of w,and u,w∈L,v∉L}.

Clearly if and only if is factor-convex.

Here is the construction of . States of are quadruples, where components , , and keep track of where is upon processing , , and (respectively). The last component is a flag indicating the present mode of the simulation process.

Formally, , where

 Q′ = Q×Q×Q×{1,2,3,4,5}; q′0 = [q0,q0,q0,1]; F′ = F×(Q−F)×F×{5}; 1. δ′([p,q0,q0,1],a) = {[δ(p,a),q0,q0,1]}, for all p∈Q,a∈Σ; 2. δ′([p,q0,q0,1],ϵ) = {[p,q0,q0,2]}, for all p∈Q; 3. δ′([p,q,q0,2],a) = {[δ(p,a),δ(q,a),q0,2]}, for all p,q∈Q,a∈Σ; 4. δ′([p,q,q0,2],ϵ) = {[p,q,q0,3]}, for all p,q∈Q; 5. δ′([p,q,r,3],a) = {[δ(p,a),δ(q,a),δ(r,a),3]}, for all p,q,r∈Q,a∈Σ; 6. δ′([p,q,r,3],ϵ) = {[p,q,r,4]}, for all p,q,r∈Q; 7. δ′([p,q,r,4],a) = {[δ(p,a),δ(q,a),r,4]}, for all p,q,r∈Q,a∈Σ; 8. δ′([p,q,r,4],ϵ) = {[p,q,r,5]}, for all p,q,r∈Q; 9. δ′([p,q,r,5],a) = {[δ(p,a),q,r,5]}, for all p,q,r∈Q,a∈Σ.

One verifies that the NFA- has states and transitions, where is the cardinality of .

To see that the construction is correct, suppose is not factor-convex. Then there exist words such that is a factor of , is a factor of , and while . Then there exist words such that such that and . Let , , , , and . Moreover, let , , and , and . Since , we know that and are accepting states. Since , we know that is not accepting.

Automaton operates as follows. In the initial state we process the symbols of using Rule 1, ending in the state . At this point, we use Rule 2 to move to by an -move. Next, we process the symbols of using Rule 3, ending in the state . Then we use Rule 4 to move to by an -move. Next, we process the symbols of using Rule 5, ending in the state . Then we use Rule 6 to move to by an -move. Next, we process the symbols of using Rule 7, ending in the state . Then we use Rule 8 to move to by an -move. Finally, we process the symbols of using Rule 9, ending in the state , and this state is in .

On the other hand, suppose accepts the input . Then we must have . But the only way to reach a state in is, by our construction, to apply Rules 1 through 9 in that order, where odd-numbered rules can be used any number of times, and even-numbered rules can be used only once. Letting be the words labeling the uses of Rules 1, 3, 5, 7, and 9, respectively, we see that , where , , and . It follows that and , and so is not factor-convex.

We have proved

Theorem 1.

If is a DFA with states, there exists an NFA- with states and transitions such that accepts the language

 L(M′) = {w∈Σ∗ : there exist u,v∈Σ∗ % such that u is a factor of v, v is a factor of w,and u,w∈L,v∉L}.
Corollary 2.

We can decide if a given regular language accepted by a DFA with states is factor-convex in time.

Proof.

Since is factor-convex if and only if , it suffices to check if using depth-first search of a directed graph, in time linear in the number of vertices and edges of . ∎

2.1.1 Factor-closure

The language is not factor-closed if and only if there exist words such that is a factor of , and , while .

Given a DFA accepting , we construct from an NFA- such that

 L(M′) = {w∈Σ∗ : there exist v,w∈Σ∗ % such that v is a factor of w, and w∈L,v∉L}.

As before, if and only if is factor-closed. The size of is .

States of are triples, where components and keep track of where would be upon processing , and (respectively). The last component is a flag as before.

Formally, , where

 Q′ = Q×Q×{1,2,3}; q′0 = [q0,q0,1]; F′ = F×(Q−F)×{3};and
1. for ,  .

2. , for all ;

3. , for all ;

4. , for all ;

5. , for , .

has states and transitions. Thus we have:

Theorem 3.

We can decide if a given regular language accepted by a DFA with states is factor-closed in time.

This result was previously obtained by Béal et al. [3, Prop. 5.1, p. 13] through a slightly different approach.

The converse of the relation “ is a factor of ” is “ contains as a factor”. This converse relation and similar converse relations, derived from the prefix, suffix, and subword relations, lead to “converse-closed languages” [2]. It has been shown by de Luca and Varricchio [5] that a language is factor-closed (factorial, in their terminology) if and only if it is a complement of an ideal, that is, if and only if for some . Ang and Brzozowski [2] noted that a language is an ideal if and only if it is converse-factor-closed, that is, if, for every , each word of the form is also in . Thus, to test whether is converse-factor-closed, we must check that there is no pair such that , , and is a factor of . This is equivalent to testing whether is factor-closed. Then the following is an immediate consequence of Theorem 1:

Corollary 4.

We can decide if a given regular language accepted by a DFA with states is an ideal in time.

The results above also apply to other converse-closed languages. Similarly, any result about the size of witness demonstrating the lack of prefix-, suffix- and subword-closure apply also to the witness demonstrating the lack of converse-prefix-, converse-suffix- and converse-subword-closure, respectively. Subword-closed and converse-subword-closed languages were also investigated and characterized by Thierrin [11].

2.1.2 Factor-freeness

Factor-free languages (also known as infix-free) have recently been studied by Han et al. [8]; they gave an efficient algorithm for determining if the language accepted by an NFA is prefix-free, suffix-free, or factor-free.

We can decide whether a DFA language is factor-free in time with the automaton we used for testing factor-closure, except that the set of accepting states is now

 F′=F×F×{3}.

Similar results hold for prefix-free, suffix-free, and subword-free languages.

2.2 Prefix-convexity

Prefix convexity can be tested in an analogous fashion. We give the construction of without proof: let , where

 Q′ = Q×Q×Q×{1,2,3}; q′0 = [q0,q0,q0,1]; F′ = F×(Q−F)×F×{3}; δ′([p,q,r,1],a) = {[δ(p,a),δ(q,a),δ(r,a),1]}  for p,q,r∈Q, a∈Σ; δ′([p,q,r,1],ϵ) = {[p,q,r,2]}  for p,q,r∈Q; δ′([p,q,r,2],a) = {[δ(p,a),δ(q,a),r,2]}  for p,q,r∈Q, a∈Σ; δ′([p,q,r,2],ϵ) = {[p,q,r,3]}  for p,q,r∈Q; δ′([p,q,r,3],a) = {[δ(p,a),q,r,3]}  for p,q,r∈Q, a∈Σ.

The NFA has states and transitions.

2.2.1 Prefix-closure

By varying the construction as before, we have

Theorem 5.

We can decide if a given regular language accepted by a DFA with states is prefix-closed, suffix-closed, or subword-closed in time.

2.2.2 Prefix-freeness

See Section 2.1.2.

2.3 Suffix-convexity

Suffix-convexity can be tested in an analogous fashion. We give the construction of without proof. Let , where

 Q′ = Q×Q×Q×{1,2,3}; q′0 = [q0,q0,q0,1]}; F′ = F×(Q−F)×F×{3}; δ′([p,q0,q0,1],a) = {[δ(p,a),q0,q0,1]}  for p∈Q, a∈Σ; δ′([p,q0,q0,1],ϵ) = {[p,q0,q0,2]}  for p∈Q; δ′([p,q,q0,2],a) = {[δ(p,a),δ(q,a),q0,2]}  for p,q∈Q, a∈Σ; δ′([p,q,q0,2],ϵ) = {[p,q,q0,3]}  for p,q∈Q; δ′([p,q,r,3],a) = {[δ(p,a),δ(q,a),δ(r,a),3]}  for p,q,r∈Q, a∈Σ.

The NFA has states and transitions.

For results on suffix-closure and suffix-freeness, see Theorem 5 and Section 2.1.2, respectively.

2.4 Subword-convexity

Subword-convexity can be tested in an analogous fashion. We give the construction of without proof. Let , where

 Q′ = Q×Q×Q; q′0 = [q0,q0,q0]; F′ = F×(Q−F)×F; δ′([p,q,r],a) = {[δ(p,a),q,r], [δ(p,a),δ(q,a),r], [δ(p,a),δ(q,a),δ(r,a)]}, for all p,q,r∈Q and a∈Σ.

The NFA has states and transitions.

The idea is that as the symbols of are read, we keep track of the state of in the first component. We then “guess” which symbols of the input also belong to and/or , enforcing the condition that, if a symbol belongs to , then it must belong to , and if it belongs to , then it must belong to . We therefore cover all possibilities of words such that is a subword of and is a subword of .

For results on subword-closure and subword-freeness, see Theorem 5 and Section 2.1.2, respectively.

2.5 Almost convex languages

As we have seen, a language is prefix-convex if and only if there are no triples with a prefix of , a prefix of , and , . We call such a triple a witness. A language could fail to be prefix-convex because there are infinitely many witnesses (for example, the language ), or it could fail because there is at least one, but only finitely many witnesses (for example, the language ).

We define a language to be almost prefix-convex if there exists at least one, but only finitely many witnesses to the failure of the prefix-convex property. Analogously, we define almost suffix-, almost factor-, and almost subword-convex.

Theorem 6.

Let be a regular language accepted by a DFA with states. Then we can determine if is almost prefix-convex (respectively, almost suffix-convex, almost factor-convex, almost subword-convex) in time.

Proof.

We give the proof for the almost factor-convex property, leaving the other cases to the reader.

Consider the NFA- defined in Section 2.1. As we have seen, accepts the language

 L(M′) = {w∈Σ∗ : there exist u,v∈Σ∗ % such that u is a factor of v, v is a factor of w,and u,w∈L,v∉L}.

Then accepts an infinite language if and only if is not almost factor-convex. For if accepts infinitely many distinct words, then there are infinitely many distinct witnesses, while if there are infinitely many distinct witnesses , then there must be infinitely many distinct among them, since the lengths of and are bounded by .

Thus it suffices to see if accepts an infinite language. If were an NFA, this would be trivial: first, we remove all states not reachable from the start state or from which we cannot reach a final state. Next, we look for the existence of a cycle. All three goals can be easily accomplished in time linear in the size of , using depth-first search.

However, is an NFA-, so there is one additional complication: namely, that the cycle we find might be labeled completely by -transitions. To solve this, we use an idea suggested to us by Jack Zhao and Timothy Chan (personal communication): we find all the connected components of the transition graph of (which can be done in linear time) and then, for each edge labeled with something other than (corresponding to the transition for some ), we check to see if and are in the same connected component. If they are, we have found a cycle labeled with something other than . This technique runs in linear time in the size of the NFA-. ∎

2.5.1 Almost closed languages

In analogy with Section 2.5, we can define a language to be almost prefix-closed if there exists at least one, but only finitely many witnesses to the failure of the prefix-closed property. Analogously, we define almost suffix-, almost factor-, and almost subword-closed.

Theorem 7.

Let be a regular language accepted by a DFA with states. Then we can determine if is almost prefix-closed (respectively, almost suffix-closed, almost factor-closed, almost subword-convex) in time.

Proof.

Just like the proof of Theorem 6. ∎

2.5.2 Almost free languages

In a similar way, we can define a language to be almost prefix-free if there exists at least one, but only finitely many witnesses to the failure of the prefix-free property. Analogously, we define almost suffix-, almost factor-, and almost subword-free.

Theorem 8.

Let be a regular language accepted by a DFA with states. Then we can determine if is almost prefix-free (respectively, almost suffix-free, almost factor-free, almost subword-free) in time.

Proof.

Just like the proof of Theorem 6. ∎

3 Minimal witnesses

Let represent one of the four relations: factor, prefix, suffix, or subword. A necessary and sufficient condition that a language be not -convex is the existence of a triple of words, where , , , and . As before, we call such a triple a witness to the lack of -convexity. A witness is minimal if every other witness satisfies , or and , or , , and . The size of a witness is .

Similarly, if is not -closed, then is a witness if , , and . A witness is minimal if there exists no witness such that , or and . The size is again . For -freeness witness, minimal witness, and size are defined as for -closure, except that both words are in .

Suppose we are given a regular language specified by an -state DFA , and we know that is not -convex (respectively, -closed or -free). A natural question then is, what is a good upper bound on the size of the shortest witness that demonstrates the lack of this property?

3.1 Factor-convexity

From Theorem 1, we get an upper bound for a witness to the lack of factor-convexity.

Corollary 9.

Suppose is accepted by a DFA with states and is not factor-convex. Then there exists a witness such that .

Proof.

In our proof of Theorem 1, we constructed an NFA- with states accepting Thus, if is not factor-convex, accepts such a word , and the length of is clearly bounded above by the number of states of minus . ∎

It turns out that the bound in Corollary 9 is best possible:

Theorem 10.

There exists a class of non-factor-convex regular languages , accepted by DFA’s with states, such the size of the minimal witness is .

The proof is postponed to Section 3.3 below.

Results analogous to Corollary 9 hold for prefix-, suffix-, and subword-convex languages. However, in some cases we can do better, as we show below.

3.1.1 Factor-closure

Theorem 3 gives us a upper bound on the length of a witness to the failure of the factor-closed property:

Corollary 11.

If is accepted by a DFA with states and is not factor-closed, then there exists a witness such that .

It turns out that this upper bound is best possible. Let be a DFA , where , , . For , , the transition function is

 δ(q0,0) = q0, δ(q0,1) = q1, δ(qi,0) = {qi+1, if i

The DFA has states. For , is illustrated in Figure 1.

Then we have the following theorem:

Theorem 12.

For the DFA above, let . For any witness to the lack of factor-closure we have , and this bound is achievable.

Proof.

Let be a minimal witness. Since the only rejecting state in leads only to itself, all the states along the accepting path of are final. We claim that is a suffix of , that is, for some . Otherwise, if the last letter of is not the last letter of , we can just omit it and get a shorter , which contradicts the minimality of . Similarly, all the states along the rejecting path of except the last one are final; otherwise, we get a shorter .

First, we prove that the set of states along the accepting path of includes both states and states. Let for . Then . If is a state, we are done. Otherwise, let for some . If , then , a contradiction. If , then , which is a state. Otherwise, , a contradiction. Hence, the set of states along the accepting path of includes both states and states.

Now, consider the set of states along the rejecting path of . We prove that the set of states along the rejecting path of includes only states. Suppose it includes both states and states. Since there is only one transition from a state to a state and all transitions from a state to a state are to the rejecting state , we have , where , and

 u2∈L1=1(0n+1)∗(ϵ+0+00+⋯+0n−1)1.

Since is a suffix of , the last letter of is also . So, by the construction of , we have that , where , and

 v2∈L2=1(0n+1)∗0n1.

It is obvious that , which contradicts the equality . Therefore, the set of states along the rejecting path of includes only states.

Consider the last block of ’s in the words and . By the structure of , we have

 u∈Σ∗1(0n)∗0n−11,

and

 v∈Σ∗1(0n+1)∗0n1.

Therefore, the length of the last block of ’s is at least . In other words, . Since the shortest word that leads to state (which is the only state having a transition to a state on input ) is , we also have , and the first part of this theorem proved.

To see that equality is achieved, let and

3.1.2 Factor-freeness

From the remarks in Section 2.1.2, we get

Corollary 13.

If is accepted by a DFA with states and is not factor-free, then there exists a witness such that .

Up to a constant, Corollary 13 is best possible, as the following theorem shows.

Theorem 14.

There exists a class of languages accepted by DFA’s with states, such that the smallest witness showing the language not factor-free is of size .

Proof.

Let . This language can be accepted by a DFA with states. However, the shortest witness to lack of factor-freeness is , which has size . ∎

3.2 Prefix-convexity

For prefix-convexity, we have the following theorem.

Theorem 15.

Let be a DFA with states. Then if is not prefix-convex, there exists a witness with . Furthermore, this bound is best possible, as for all , there exists a unary DFA with states that achieves this bound.

Proof.

If is not prefix-convex, then such a witness exists. Without loss of generality, assume that is minimal. Now write , where and .

Let , , and . Let be the path from to traversed by , and let be the states from to (not including ), be the states from to (not including ), and be the states from to (not including ). See Figure 2. Since is minimal, we know that every state of is rejecting, since we could have found a shorter if there were an accepting state among them. Similarly, every state of must be accepting, for, if there were a rejecting state among them, we could have found a shorter and hence a shorter . Finally, every state of must be rejecting, since, if there were an accepting state, we could have found a shorter .

Let for . There are no repeated states in , for if there were, we could cut out the loop to get a shorter ; the same holds for and . Thus for .

Now and are disjoint, since all the states of are rejecting, while all the states of are accepting. Similarly, the states of are disjoint from . So and . It follows that . Since , it follows that .

To see that is optimal, consider the DFA of states accepting the unary language . Then is not prefix-convex, and the shortest witness is . ∎

3.2.1 Prefix-closure

For prefix-closed languages we can get an even better bound.

Theorem 16.

Let be an -state DFA, and suppose is not prefix-closed. Then the minimal witness showing is not prefix-closed has , and this is best possible.

Proof.

Assume that is a minimal witness. Consider the path from to , passing through . Let denote the part of the path from to (not including ) and denote the part of the path from to (not including ). Then all the states traversed in must be rejecting, because if any were accepting we would get a shorter . Similarly, all the states traversed in must be accepting, because otherwise we could get a shorter . Neither nor contains a repeated state, because if they did, we could “cut out the loop” to get a shorter or . Furthermore, the states in are disjoint from . So the total number of states in the path to (not counting ) is at most . Thus .

The result is best possible, as the example of the unary language shows. This language is not prefix-closed, can be accepted by a DFA with states, and the smallest witness is . ∎

3.2.2 Prefix-freeness

For the prefix-free property we have:

Theorem 17.

If is accepted by a DFA with states and is not prefix-free, then there exists a witness with . The bound is best possible.

Proof.

The proof is similar to that of Theorem 15. The bound is achieved by a unary DFA accepting . ∎

3.3 Suffix-convexity

For the suffix-convex property, the cubic upper bound implied by Corollary 9 is best possible, up to a constant factor.

Theorem 18.

There exists a class of non-suffix-convex regular languages , accepted by DFA’s with states, such the size of the minimal witness is .

Proof.

Let

 L=bbb(an−1)+ ∪ bb(a+aa+⋯+an−1)(an)∗ ∪ b(an+1)+.</