Deciding Regularity of the Set of Instances of a Set of Terms with Regular Constraints is EXPTIME-Complete

# Deciding Regularity of the Set of Instances of a Set of Terms with Regular Constraints is EXPTIME-Complete

Omer Giménez Universitat Politècnica de Catalunya, Barcelona, Spain (ogimenez,ggodoy@lsi.upc.edu). The second author was supported by Spanish Min. of Educ. and Science by the FORMALISM project (TIN2007-66523) and by the LOGICTOOLS-2 project (TIN2007-68093-C02-01).    Guillem Godoy11footnotemark: 1    Sebastian Maneth NICTA and University of New South Wales, Sydney, Australia (sebastian.maneth@nicta.com.au)
###### Abstract

Finite-state tree automata are a well studied formalism for representing term languages. This paper studies the problem of determining the regularity of the set of instances of a finite set of terms with variables, where each variable is restricted to instantiations of a regular set given by a tree automaton. The problem was recently proved decidable, but with an unknown complexity. Here, the exact complexity of the problem is determined by proving EXPTIME-completeness. The main contribution is a new, exponential time algorithm that performs various exponential transformations on the involved terms and tree automata, and decides regularity by analyzing formulas over inequality and height predicates.

Key words. EXPTIME complexity, regularity, terms with variables, pattern matching, regular constraints

AMS subject classifications. 68Q17, 68Q42, 68Q45

## 1 Introduction

Finite representations of infinite sets of terms are useful in many areas of computer science. The choice of formalism for this purpose depends on its expressiveness, but also on its computational properties. Finite-state tree automata (TA) [6, 2] are a well studied formalism for representing term languages, due to their good computational and expressiveness properties. They characterize the “regular term languages”, a classical concept used, e.g., to describe the parse trees of a context-free grammar or the well-formed terms over a sorted signature [12], to characterize the solutions of formulas in monadic second-order logic [4], and to naturally capture type formalisms for tree-structured XML data [13, 1]. Similar to the case of regular sets of words, regular term languages have numerous convenient properties such as closure under Boolean operations (intersection, union, negation), decidable properties such as finiteness and inclusion, and they are characterized by many different formalisms such as regular grammars, regular term expressions, congruence classes of finite index, deterministic bottom-up TA, nondeterministic top-down TA, or sentences of monadic second-order logic [2]. Deterministic TA, for instance, can be effectively minimized and give rise to efficient parsing.

When the used formalism for representing an infinite set of terms is not a TA, it is often expedient to decide whether the represented set is in fact regular. A simple and natural way of describing an infinite set of terms, is through the use of “patterns”. A pattern is a term with variables; it describes all terms obtained by replacing the variables by (variable-free) terms; see, e.g., [11, 10], and the references given there. Term patterns are used for pattern matching in most modern programming languages, and were already present in very early languages such as LISP. They are a central concept in compiling, natural language processing, automated deduction, term rewriting, etc. In some of these applications, variables in patterns are restricted to be replaced by terms in a regular language. E.g. in a programming language with regular types (see, for instance, [8, 9]), variable instances might be constrained to regular term languages. Typically, term patterns in a programming language must be linear (i.e., every variable occurs at most once) in order to guarantee that the resulting type is regular. Our result shows that even if non-linear patterns are allowed (which is the case in logic programming languages such as Prolog), one can statically determine regularity, i.e., the existence of an exact regular type, in exponential time.

More precisely, we consider the problem of determining the regularity of the set of instances of a set of terms with regular constraints, which we abbreviate as the “RITRC” problem. A particular case of this problem, in which variables can be replaced by arbitrary terms (without variables), was considered in [11] and shown to be coNP-complete (cf. also [10]). The general RITRC problem was recently proved decidable [7]. The complexity of their decision procedure was left open in [7], but can easily be seen to exceed exponential time. Moreover, their solution is based on a rather general result of [3] about first-order formulas with regular constraints, for which the complexity is not known.

In this paper, we determine the complexity of the RITRC problem by proving that it is EXPTIME-complete. At the beginning of Section LABEL:sect:main we show that the RITRC problem is EXPTIME-hard. This is done via a straightforward reduction from the finite intersection emptiness problem for tree automata. The remaining part of Section LABEL:sect:main describes an EXPTIME algorithm solving the problem, starting with an overview of it in Section LABEL:subsect:overview. In summary, the algorithm first changes the regular constraints from several TA to one single tree automaton (of exponential size) with special properties. It then picks a non-linear term from the given set of terms, and checks the “infinite instances property of in ”: are there infinitely many instantiations of a non-linear variable in , which are not instances of (under the regular constraints)? If the infinite instances property holds for some in , then our algorithm stops and we know that the set of terms represented by (under the regular constraints) is not regular. Otherwise, we can replace by a new term that is linear in the variables, i.e., which does not contain duplicated variables. Roughly speaking, our algorithm then starts over again, with the new set . In this way, the algorithm will construct a set of terms in which all terms are linear in the variables, if and only if the represented set is regular. To check the infinite instances property of in , we instantiate the term at all non-variable positions of terms in , and then formulate inequality constraints of the resulting terms with terms of . It is a non-trivial task to efficiently solve such inequality constraints. In fact, in order to solve systems of such inequality constraints in EXPTIME, it was a crucial step for us to introduce additional height constraints on the variables of the inequality constraints. The final formula over height and inequality predicates characterizes all instances of that are not instances of terms in . Our algorithm solves the RITRC problem in exponential time by iteratively constructing and solving such formulas .

## 2 Preliminaries

The size of a set is denoted by . A signature consists of an alphabet , i.e., a finite set of symbols, together with a mapping that assigns to each symbol in a natural number, its arity. We write to denote the subset of symbols in that are of arity , and we write to denote that is a symbol of arity . The set of all terms over is denoted and is inductively defined as the smallest set such that for every , , and , the term is in . For a term of the form we simply write . For instance, if then is the set of all terms that represent binary trees with internal nodes labeled and leaves labeled . We fix the set of variables, i.e., any set of variables is always assumed to be a subset of . The set of terms over with variables in , denoted , is the set of terms over where every symbol in has arity zero. By we denote the set of variables that occur in . By we denote the size of , defined recursively as for each , and , and for each in . By we denote the height of , defined recursively as for each , and , for each , and for each . Given a term , its set of positions equals . Here, denotes the root node, and denotes the th child of position . The subterm of at position is denoted by , and the symbol of at position is denoted by ; we say that is labeled by . For instance, for , equals and position is labeled by . For a set , we use to denote the set of positions of that are labeled by symbols in . In particular, we define for the sets and of variable positions and non-variables positions as and , respectively. E.g., for as above, and . When a position is of the form , we say that is a prefix of . For a set of positions , we denote by the set . For terms and , we denote by the result of replacing the subterm at position in by the term . For instance, .

A (deterministic) tree automaton (over ), DTA for short, is a tuple where is a finite set of states, is the set of accepting states, is a signature, and is a set of transitions of the form , where , , and . Moreover, for each and each there exists at most one (and at least one if the automaton is complete) such that is in . The language recognized by is the set where is recursively defined as if , , , is a transition in , and, for each , . Note that, when is not complete, might be undefined. We also define, for , the set of terms for which arrives to state . Note that for all . We also extend to terms in by assuming that the states have arity and for each . A set of terms is regular if there exists a DTA such that . The size of a transition is and the size of is .

Given a DTA, it is decidable whether its recognized language is (i) empty, (ii) finite, or (iii) has cardinality , for a given . The corresponding constructions all run in polynomial time and are straightforward generalizations of the ones for classical finite (word) automata; proofs can be found in Theorems 1.7.4, 1.7.6, and 1.7.10 of [2]. The following computational problems, together with the running times, are a consequence of the same proofs.

###### Lemma 2.1

Let be a DTA and a natural number. Each of the following sets can be computed in polynomial time: in , in , and in .

Sets of Terms with Regular Constraints Let be a finite set of variables and a signature. A regular constraint (over and ) is a mapping that associates to every a DTA over . A solution of is a mapping such that, for each , . A set of terms with regular constraints (over and ) is a pair where is a finite subset of and is a regular constraint over and . The language of is defined as . A term in is also called an instance of .

The following result is due to [11], cf. also [10].

###### Proposition 2.2

Let , a finite subset of , and the regular constraint that maps every to the trivial DTA that recognizes . Regularity of is coNP-complete.

When analyzing complexity, with we refer to the sum of sizes of all terms in , and with we refer to the sum of sizes of all DTA in the image of . With and we refer, as usual, to the number of elements in the sets and (i.e. number of pairs of the set defining the mapping ). We also do the following assumption in order to ease the complexity analysis.

Assumption: The maximum arity of a function symbol in is . It is well known that any arbitrary tree can be coded as a binary tree of essentially the same size. Usual such codings (such as the one taking first-child to left-child and next-sibling to right-child) preserve regularity of sets of terms (see, e.g., Section 8.3.1 in  [2]); moreover, it can be seen easily that the transformation of the regular constraints into this new binary signature produces an at most quadratic size increase.

## 3 Regularity of the instances of a set of terms with regular constraints

Let be a set of terms with regular constraints. The “regularity of the instances of a set of terms with regular constraints problem”, RITRC for short, asks whether or not the set is regular. We know, by Proposition LABEL:prop:trivial_constraint, that RITRC is coNP-complete in the particular case that maps each variable to a DTA that accepts all terms. In general, i.e., with regular constraints, decidability of RITRC was proved in [7]; however, the complexity remained open. The algorithm of [7] does not run in exponential time, and in fact it has a far worse complexity. In this section we show that RITRC is EXPTIME-complete. We start with the easy part by showing that RITRC is EXPTIME-hard.

###### Theorem 3.1

RITRC is EXPTIME-hard.

Proof. Let be a signature with and let be DTAs over . It is well known that testing whether is EXPTIME-complete, cf. Theorem 1.7.5 of [2]. It follows that “universality of union”, i.e., testing whether is EXPTIME-complete. This is because a DTA can easily be complemented in polynomial time (first complete the DTA by adding, for any missing transition, a transition to a new “sink” state; second, change into ). We now reduce universality of union to RITRC. Let be any fixed DTA that recognizes and let . The set of terms with regular constraints , where

 S ={f(f(x,x),y),f(x′1,x1),…,f(x′n,xn)} M ={x1↦A1,…,xn↦An,x↦A,y↦A, x′1↦A,…,x′n↦A},

is regular if and only if . To see this, consider first the case where . Then , which is regular. In the other case, let be in . Intersect with the regular set . Since regular term languages are closed under intersection, the resulting set would be regular, if was; but, the resulting intersection is . By standard pumping arguments (see, e.g., Example 1.2.1 of [2]) this set is not regular. Thus, is not regular in this case.

Proving that RITRC is in EXPTIME is considerably more complicated.

### 3.1 Overview of our algorithm for RITRC

Algorithm in [7]. In [7] decidability of RITRC was proved. We first explain the idea of that proof, and why it does not give rise to an EXPTIME algorithm. Then we give an overview of the algorithm presented in this paper. The following is the basic property used for deciding RITRC in [7] (and here).

###### Definition 3.2

Let be a set of terms with regular constraints. The term satisfies the infinite-instances property in if some variable has multiple occurrences in , and there exists infinitely many instances of which are not instances of and all of them different on , i.e., for all .

In [7] it was shown that the infinite-instances property is decidable and that it implies non-regularity of . To decide RITRC, the algorithm of [7] first looks for a term in with multiple occurrences of some variable satisfying . If no such term exists, then it stops concluding regularity of (note that in this case is regular for each term in , and regular sets are closed under union). Otherwise, it checks the infinite-instances property of in . In the affirmative case, it stops concluding non-regularity of . In the negative case, there are only a finite number of possible instantiations of each duplicated variable in providing a term in and not in . Thus, by replacing by a finite number of instantiations of , the represented language is preserved, and we obtain less duplicated variables. The algorithm in [7] decides regularity of by iterating this process.

Estimating the complexity. To determine the complexity of the previous algorithm, we need to know how large is the number of instantiations of , how large the terms are, and, of course, how expensive it is to decide the infinite instances property. In [7], the latter is solved through a result of [3] about first-order formulas with regular constraints. The precise complexity of this result of [3] is not known, but it is expected to be higher than that of solving the infinite-instances property, since it solves a more general problem. We therefore devise our own algorithm for checking this property. But, also the sum of sizes of the terms poses a problem, as it can grow iterated exponential, so the algorithm in [7] is certainly not in EXPTIME. One of the ideas of our new algorithm is hence not to replace by . Instead, we are able to find a “small” number (which depends on and ) such that all terms are guaranteed to be of height smaller than . To take advantage of this fact, we add a new kind of constraint to which allows duplicated variables of to be replaced only by “small” terms. The algorithm then continues on with this new system (called restricted regular constraints, see Definition LABEL:def:restregconstr), which has regular constraints plus height constraints on the variables.

Infinite-instances algorithm. How do we check the infinite-instances property of in ? In Sections LABEL:sec:determining-termLABEL:sec:subsumed-terms, and LABEL:sec:formula-instances we give an algorithm that solves this problem under several assumptions. To begin with, we require that the term is determined (see Definition LABEL:def:determined for the precise notion) in all the non-variable positions of terms in . We also assume that the regular constraint is given by a single DTA (instead of the multiple ones in the image of ), and a mapping that associates variables with states of . Finally, we require this DTA to satisfy the -or- property of Definition LABEL:def:oneorkautomaton, which says that for any state of , the cardinality of is either , or it is greater than or equal to . The reason for these assumptions is as follows. In order to decide the infinite-instances property, we compute a formula whose solutions are the instances in that are not in . This formula is a disjunction of conjunctions of inequalities, where each conjunction has at most inequalities. After some transformations on by means of a system of inference rules, the variables with an associated state of satisfying disappear. Thanks to the -or- property, the remaining variables in have at least possible instantiations. This fact is used to show that, for any surviving conjunction in , there is a variable instantiation that makes true the at most inequalities it is composed of, and variables with infinite language have infinite choices. Hence, we obtain that satisfies the infinite-instances property in if the transformed formula is not empty.

Overview of the algorithm. We give an outline of the EXPTIME algorithm that solves RITRC for a given instance . First of all, we transform into by preserving the represented language, where is a single regular constraint (Definition LABEL:def:singleregconstr), and is the adaptation of from to . Intuitively, is the same problem stated with a single -or- DTA; the sizes of both and can be exponential with respect to the sizes of and . This transformation is described in Section LABEL:subsec:singleautomaton. The single regular constraint is then converted to a restricted regular constraint , the new type of constraint, which we introduce in Section LABEL:subsec:heightconstraints, that takes account of height restrictions.

The algorithm then proceeds as follows. At each step it picks a term of without height constraints, and with multiple occurrences of some variable satisfying . If no term of this kind exists, then it stops concluding regularity of . Otherwise, it chooses a term satisfying the above conditions, and checks the infinite-instances property of with respect to . To do so, the algorithm loops over all possible partial instantiations of in the non-variable positions of , and for each , it finds a subset , with , such that has the infinite-instances property for if and only if it has the property for . The fact that is small allows to check the infinite-instances property in exponential time. In the affirmative case the algorithm stops concluding non-regularity of . If no determination satisfies the infinite-instances property, the restricted regular constraint is modified so as to impose height constraints on the variables of with multiple occurrences. Since the number of terms with duplicated variables and without height constraints decreases, the iteration of this process decides regularity of . A careful analysis of all the steps involved will show that the time complexity is exponential.

### 3.2 Simplification to a single DTA

Recall from the preliminaries that we assume to be a fixed but arbitrary signature containing no symbol of arity greater than . We start with a set of terms with regular constraints over a finite set of variables . Recall that is a finite set of terms and is a function that maps each to a DTA over . We now adapt this definition to a setting with only one single DTA , and where variables in are now mapped to states in . Moreover, we do not need accepting states anymore and simply drop them from ’s definition (a “DTA without accepting states”).

###### Definition 3.3

A single regular constraint (over and ) is a pair , where is a complete DTA without accepting states and is a mapping . The size of is . A solution of is a mapping such that, for each , it holds that . A set of terms with single regular constraints (over and ) is a pair , where is a finite subset of and is a single regular constraint over and . The language of is defined as . A term in is also called an instance of .

Transforming a set of terms with regular constraints into a set of terms with single regular constraints satisfying is rather easy by considering the product automaton . But the size of can be exponential in the size of . Moreover, it follows from Proposition LABEL:prop:trivial_constraint that regularity of is at least NP-hard. Hence, it is not enough to have an EXPSPACE-reduction from one problem to the other if we want to obtain an EXPTIME algorithm for the initial problem.

Thus, in the translation from into we keep in mind some additional properties obtained by the transformation process. For instance, the terms in are very similar to those in because they are obtained through variable renamings; we call this “structural similarity”. Moreover, as mentioned in the outline of Section LABEL:subsect:overview, we want the DTA to have the “-or-” property, with . We proceed to define both properties.

###### Definition 3.4

Let be sets of variables. A total function is a variable renaming if it is injective, i.e., for . For a term , is the term obtained from by replacing in each variable by . Two terms and are structurally similar, denoted by , if for a variable renaming . For a set of terms , is the maximum number of non-structurally similar terms in , i.e., . Given a single regular constraint we say that two terms and are structurally equal (with respect to ) if they are structurally similar, and for all .

Note that if and are structurally equal with respect to , then ; the converse does not necessarily hold.

###### Definition 3.5

Let be a DTA. Let be a natural number. We say that is a -or- DTA if each state in satisfies either or .

###### Lemma 3.6

Let be a set of terms with regular constraints. Then, can be transformed in exponential time into a set of terms with single regular constraints such that and the following properties hold.

• satisfies that is a -or- DTA.

• is complete and satisfies that and

• Each term in is structurally similar to some term in . In particular, .

• Every two distinct terms are not structurally equal with respect to .

• Each two distinct terms do not share variables.

Proof. Let and for . We first complete each DTA to a new DTA by adding a sink state and all undefined transitions to it. Recall the assumption that the maximum arity of is . Thus, and . We now construct the product automaton (without accepting states) , i.e., we set and if, for each , has the transition , then we add the transition to . Since each state of is a tuple of states of the automata in plus a sink state, .

We then transform into a -or- DTA. To this end, we compute the mapping with , according to Lemma LABEL:lem:DTA_folklore. Now, using we obtain the desired as output of the following algorithm.

 Input: A′=⟨Q′,Σ,δ′⟩ and M′:Q′→{1,…,|S|}. Q:={q ∣ q∈Q′∧M′(q)=|S|} ∪ {qi ∣ q∈Q′∧1≤i≤M′(q)<|S|}. δ:=∅. For each q in Q′ do: If M′(q)=|S| then: For each f(q1,…,qm)→q in δ′ do: For each i1,…,im with qi11…,qimm∈Q do: Add f(qi11,…,qimm)→q1 to δ. else: Let l1→q,…,lk→q be all transitions of δ′ with q as right-hand side. counter:=1. For each i in {1,…,k} do: Let f(q1,…,qm)→q be li→q. For each i1,…,im with qi11,…,qimm∈Q do: Add f(qi11,…,qimm)→qcounter to δ. counter++. Complete A=⟨Q,Σ,δ⟩ and return the result.

It is clear that this algorithm generates a complete -or- DTA with , because at most new states are created for every state in . Moreover, since the maximum arity of is 2, then at most transitions are possible with such number of states. The construction runs in exponential time because is constructed in exponential time, is constructed in time polynomial in by Lemma LABEL:lem:DTA_folklore, and is constructed in time .

Now, the set is obtained in the following way. Recall that the states in are in fact of the form , i.e., are tuples of states plus an index satisfying . For each variable in the domain of , we define the set of variables . We define the domain of the mapping as , and the image of each by as . Finally, let be the set of substitutions over satisfying . We compute as a minimal set satisfying that each one of its terms is structurally equal to some term in , and vice-versa (i.e.  is computed from by removing repetitions modulo structural equality). Moreover, we force the terms in to do not share variables, by renaming them in , and defining them in and whenever it is necessary. Obviously, each term in is structurally similar to some term in , and any two distinct terms in are not structurally equal. Each has at most variables. Thus, has at most substitutions, and hence . Generating consists of considering all of such combinations of a term in and a substitution in . Thus, the time complexity for creating from and is proportional to its size, i.e., is in . In total, is constructed in exponential time w.r.t. .

Let by the set of terms with single regular constraints that was obtained from according to Lemma LABEL:lemma-transformationtosingle. Our algorithm proceeds by considering a term in , and analyzing the kind of instances which are in but not in . Depending on this analysis, it either concludes non-regularity of , or deduces that the height of the substitutions for some variables of can be bounded by , where is the maximum height of the terms in . To manage this height constraint, we extend the notion of single regular constraint as follows.

###### Definition 3.7

A restricted regular constraint (over ) is a tuple , where are sets of variables, is a DTA, is a mapping , and is a natural number. The size of is . A solution of is a mapping such that for all it holds , and moreover, if then . For a finite set , the pair is a set of terms with restricted regular constraints. The language of is . A term in is also called an instance of .

Obviously, the set of terms with single regular constraints can be transformed into the set of terms with restricted regular constraints , and the represented language is preserved, i.e. . For a restricted regular constraint , we can define the infinite-instances property analogously to Definition LABEL:def:infinst, where it is defined for a set of terms with regular constraints. As mentioned before, when a term in satisfies the infinite-instances property, then is not regular [7]. Exactly the same thing, with the same proof, can be said about a set of terms with restricted regular constraints .

###### Lemma 3.8

Let be a set of terms with restricted regular constraints. Let be a term satisfying the infinite-instances property in . Then, is not regular.

In order to make the paper self-contained, we prove this result. The proof is simplified and adapted to the case of restricted regular constraints.

Proof. We prove the lemma by contradiction, i.e. we assume that there exists DTA recognizing in order to reach a contradiction.

By the assumptions, there exists a variable with more than one occurrence in , and infinite instances of which are not instances of , and satisfying for all .

Let be , let be , and let be the maximum height of the terms in . Let be one of the positions in where occurs.

Since the instances are not in and are different on , there is a solution ( for some ) of satisfying that is not an instance of and . Let be a position such that is a position of , and . By a simple pumping argument, there exist positions and satisfying that is a position of , , and .

Let be . Let be the context . We consider the term . Note that is accepted by . Thus, in order to reach a contradiction, it suffices to see that is not an instance of . It is clearly not an instance of , since we have the term as a subterm in at a position of in , and the term as a subterm in at another position of in . Thus, it rests to see that is not an instance of for each in .

For each term in , we know that the term is not an instance of , and this has to be due to one of the following reasons:

• There is a position in satisfying that is not in ,

• There is a position in satisfying ,

• There is a position in satisfying ,

• There are positions and in satisfying and .

• There is a position in satisfying and .

In cases (a), (b), (c) and (e) it is straightforward that is not an instance of by the same reason. Thus, assume we are in case (d). If both and are disjoint with , then , and hence, is not an instance of . If one of or , say , is a prefix of , then, also holds, because . Therefore, is not an instance of in any case, and this concludes the proof.

For the particular case of a singleton , Lemma LABEL:lem:nonreg implies the following statement.

###### Corollary 3.9

Let be a set of terms with restricted regular constraints. Then, is regular if and only if for each variable occurring at least twice in , either or .

The previous corollary naturally leads to the following definition of regular term.

###### Definition 3.10

Let be a restricted regular constraint. A term is regular with respect to if for each variable occurring at least twice in , either or .

### 3.4 Determining a term

At this point, we want to test whether a term satisfies the infinite-instances property with respect to , that is, we want to analyze the instances of which are not instances of . To make this problem easier, it would be good to have determined at all non-variable positions of the terms in , according to the following definition.

###### Definition 3.11

For a position and a term , we say that is determined at if either or there is a prefix of such that is a constant symbol, i.e., it is in . The term is determined at a set of positions if it is determined at each .

One of the nice (and obvious) properties of determined positions of is that, for any substitution mapping variables to terms, the symbol is either undefined or coincides with .

###### Lemma 3.12

Let be a position and a term determined at . Let be mappings from variables to . Either is not a position of both and , or .

Proof. No prefix of is such that is a variable. Hence, for every substitution , we have that is undefined at if so was , or that