A Proofs

Bottom Up Quotients and Residuals for Tree Languages

Abstract

In this paper, we extend the notion of tree language quotients to bottom-up quotients. Instead of computing the residual of a tree language from top to bottom and producing a list of tree languages, we show how to compute a set of -ary trees, where is an arbitrary integer. We define the quotient formula for different combinations of tree languages: union, symbol products, compositions, iterated symbol products and iterated composition. These computations lead to the definition of the bottom-up quotient tree automaton, that turns out to be the minimal deterministic tree automaton associated with a regular tree language in the case of the -ary trees.

1 Introduction

Tree languages are used in numerous domains of applications in computer science, e.g. representation of XML documents. Regular tree languages are recognized by finite tree automata, well-studied objects leading to efficient decision problems. Among them, the membership test, that is to determine whether a given word belongs to a language. Tree languages, that are potentially infinite, can be finitely described by regular tree expressions. Consequently it is an important subject of research to convert an expression into an equivalent automaton.

In the case of word (that can be seen as trees with unary symbols) this is an active subject for more than fifty years: One of the first conversion method is the computation of the position automaton [Glu61] with linear size and a quadratic construction time w.r.t the number of occurrences of symbols in the expression. Three years later, Brzozowski proposed an alternative construction, the derivative automaton [Brzo64], that is deterministic and then exponential-sized automaton. This construction is based on the operation of expression derivation, implementing the computation of the language quotient over expression. Slightly modifying this method, by replacing expression by set of expressions, Antimirov constructed the derived term automaton [Ant96] which is a linear-sized but not necessarily deterministic automaton (notice that Champarnaud and Ziadi have shown in [JMCDZ01] that this automaton is identical to Mirkin’s prebase automaton [Mir66]).

Some of these methods have already been extended to tree expression: the position tree automaton was introduced in [LOZ13], and the top-down partial derivative automaton [KM11] (see [MSZ14b, MSZ14] for an other version of the position tree automaton and its morphic links with other methods), producing non-deterministic and linear-sized tree automaton. As far as top-down deterministic tree automata are concerned, there exist regular languages that can not be recognized; Therefore, the notion of (top-down) derivative cannot be well-defined but it is not the case for bottom-up tree automata. A first step toward the computation of tree derivative as already been achieved in [CGLTT03, Lev81], defining the bottom-up quotient of trees, that is a set of unary trees.

In this paper, we extend the notion of bottom-up quotients to trees of any arity. Moreover, we present computation formulae for several combinations of tree languages. Finally, using our quotient definition, we present an alternative construction of the minimal bottom-up tree automaton of regular tree language via the bottom-up quotient automaton (isomorphic to the one defined in [CGLTT03]).

2 Preliminaries

See [tata] for a whole presentation about trees, tree languages and tree automata.

A graded alphabet is a finite set of symbols, with a set of symbols of arity . A tree over is inductively defined by , with any integer, any symbol in and , , any trees over . The set of the trees over is denoted by . In the following, the notion of tree is extended by considering -ary trees, that are trees leaves of which are missing. As an example, while is a -ary tree, the tree is unary and is ternary. Given an integer , denotes the set of the -ary trees over the graded alphabet .

The composition of trees is the operation from to defined for any trees , , , , with , denoted by , as the action of grafting to the -th missing leaf in . Notice that endows with a structure of operad5, with as an identity unary element. As an example, . To improve readability, the identity unary tree can be denoted by (e.g. ); Thus, for any -ary tree , .

The symbol in a -ary tree can be replaced by occurrences of distinct symbols , , , where , , are any integers in . For a -ary tree , we denote by the set of -indices. This finite and naturally ordered subsets of contains the indices of tree symbols appearing in . Thus, a -ary tree with as -indices set is inductively defined by with an integer and , or with a symbol in and for , a -ary tree of -indices , such that for , and . In this case, the composition substitutes to in , where . Notice that any -ary tree satisfies , that the occurrences are not necessarily indexed w.r.t. their apparition order in , i.e. and that the empty trees are not identity elements anymore, since they may change the index of an empty tree.

The composition is inductively defined as follows: For any -ary tree with , for any trees , , , it holds:

(1)

A tree language over is a subset of . A tree language is homogeneous if all the trees it contains admit the same arity with the same set of -indices, and -homogeneous if it only contains -ary trees with the same set of -indices. In this case, we denote by this set. The set of the tree languages over is denoted by , and the set of the -homogeneous tree languages by for some integer . Notice that the union of two -homogeneous tree languages of -indices is a -homogeneous tree language of -indices .

The composition is extended to an operation from to : for any language in , for any languages , , in such that for any , . Notice that if is -homogeneous for any integer , then is -homogeneous. Given a -homogeneous tree language of -index and an integer , the iterated composition is recursively defined by , . The composition closure of is the language . Notice that is -homogeneous of -index .

Let be a symbol in , be a tree language in and be a tree in . The tree substitution of by in , denoted by , is the tree language inductively defined by if ; if ; if with and any trees over . The -product of two tree languages and over , with in , is the tree language defined by . Notice that if is -homogeneous of -index , since is -homogeneous, then is -homogeneous of -index . The iterated -product of a -homogeneous tree language over is the tree language recursively defined by: , . The -closure of the tree language is the language defined by . Notice that since is -homogeneous, is -homogeneous.

The set of regular languages over is the smallest set containing any subset of that is closed under union, symbol products and closures. A -homogeneous regular language is said to be regular if it belongs to .

Notice that the composition and the composition closure can be reinterpreted in terms of symbol product and closures. However, this equivalence leads to enlarge the cardinal of the alphabets and the number of operations. Consequently we use these operators as syntactic operations.

A tree automaton is a -tuple with a graded alphabet, a set of states, the set of final states, and the transition function from to . The domain of this function can be extended to as follows: for any symbol in , for any subsets , , of , . Finally, we denote by the function from to defined for any tree in by

A tree is accepted by if and only if . The language recognized by is the set of trees accepted by , i.e. . It can be shown [tata] that a tree language is recognized by some automaton if and only if it is regular. A state in is accessible if there exists a tree in such that . Consequently, if a state is not accessible, it can be removed, and the transitions this state is a destination or a source of too, without modifying the recognized language A tree automaton is accessible if all of its states are accessible. A tree automaton is deterministic if for any symbol in , for any -tuple of states, . Hence, an accessible tree automaton is deterministic if and only if for any tree in , . For any tree automaton , there exists a deterministic tree automaton such that . The automaton can be computed from using a subset construction [tata, RS59]. The domain of the function is extended to as follows: for any tree in , for any state in ,

3 Bottom-Up Quotients

In this section, we define the bottom-up quotient of a tree language w.r.t. a tree, that is an operation that delete some internal nodes in trees. The remaining part of the tree is usually called a context in the literature [tata]; here, we call these objects -ary trees, since we need to consider the parameter . Consequently, we reinterpret classical results from the quotient point of view. Basically, the quotient is the dual operation of the composition: the quotient of a tree w.r.t. a tree is the operation producing some trees containing an occurrence of and such that substituting by in produces . As a direct consequence, since may appear in , the production of needs a reindexing of the -indices to be performed. In the following, we choose to increment these indices.

Example 1

Let be a tree over , with , , and . Let . Then . Indeed, it can be shown that for any in , .

Let us formalize the notion of quotient: Let be a -ary tree in and be a -ary tree in such that . Let , . Let . The quotient of w.r.t. is the -homogeneous tree language that contains all the trees satisfying the two following conditions:

(2)

As a direct consequence,

(3)
(4)
Definition 1

Let be a graded alphabet. Let be a tree language in and be a tree in . The bottom-up quotient of w.r.t. is the tree language .

As a direct consequence of Equation (4), the membership of a tree in a tree language can be restated in term of quotient:

Proposition 1

Let be a tree language over a graded alphabet and be a tree in . Then .

3.1 Bottom-Up Quotient Inductive Formulas for Trees

Given a -ary tree and an integer , we denote by the substitution of any symbol by the symbol . Given a tree language and an integer , we denote by the tree language . As a direct property, it holds:

(5)

The inductive computation of the bottom up quotient of a tree w.r.t. another tree can be performed using two basic computations: The bottom up quotient of a tree w.r.t. an empty tree; Then, the bottom up quotient of a tree w.r.t. a symbol in . The bottom-up quotient of a tree w.r.t. a symbol of an alphabet can be inductively computed as follows: since the quotient is the inverse operation of the composition, computing the quotient of a tree w.r.t. a tree is in fact substituting an occurrence of in by (and increasing the -indices), where if and .

Proposition 2

Let be a graded alphabet, be an integer, and be a symbol in . Then:

with an integer, a symbol in , any trees in and for all , .

According to the definition of the bottom-up quotient (Equation (2)) and from Definition 1, quotienting by an indexed is a reindexing of all the indexed .

Proposition 3

Let be a graded alphabet. Let be a -homogeneous language with . Let be an integer. Then:

Finally, bottom-up quotienting a tree w.r.t. to a -ary tree can be inductively performed as follows: first, the quotient of w.r.t. is computed, producing a set of trees in which the substitution of by produces . Then the quotient of w.r.t. is computed, producing a set of trees in which the substitution of by and of by produces . Eventually, the quotient of w.r.t. is computed, producing a set of trees in which the substitution of by , , and of by produces . Finally, the quotient of w.r.t. is computed, producing a set of trees in which the substitution of by produces a tree in which the substitution of by , , and of by produces ; therefore . Notice that dealing with implies that a reindexation of the indices have to be done:

If contains an occurrence of an empty tree, then its index is increased times by , by the quotients; consequently, in order to quotient w.r.t. , if an occurence of appears in , then the set resulting from quotienting by contains some tree with an occurrence of , that has to be reindexed into ;

If contains an empty tree, appearing in for example, then the set , containing the empty trees (if contains some occurrences of ) and the empty tree , must not be quotiented w.r.t. : if appears in , then its indices has been increased, and therefore has to be considered for quotienting . More formally, it can be shown that:

Proposition 4

Let be a graded alphabet. Let be a -ary tree in with a symbol in and a -tuple of trees in different from . Let be a tree in with . Let . Then:

The indexation of plays a fundamental role in our construction: it is necessary in order to satisfy the noncommutativity of the tree operad (i.e. ).

Example 2

Let us consider the tree . Then:

Consequently,

3.2 Bottom-Up Quotient Formulas for Languages Operations

Let us show now how to inductively compute the bottom-up quotient of a language w.r.t. a tree. As a direct consequence of Definition 1:

Lemma 1

Let be a graded alphabet. Let be a tree in , and and be two languages over . Then: .

Then, since the sum is distributive over the composition, as a direct consequence of Lemma 1 and of Proposition 4, it holds:

Corollary 1

Let be a graded alphabet. Let be a -ary tree such that is a symbol in and is a -tuple of trees in different from . Let be a -homogeneous tree language over with . Let . Then:

Following Corollary 1, it remains to show how to inductively compute the bottom-up quotient of a language w.r.t. a symbol in . In the following, we use the partial composition define for any -ary tree (resp. -homogeneous language ) of -indices with , for any tree (resp. tree language ) by:

Let us first show how to quotient a language obtained via a symbol product from a tree. Computing a -product is basically replacing any occurrence of the symbol in a tree by a tree language . Hence, quotienting by a symbol is performed following these two conditions:

  • the occurrences of that have to be removed by quotienting may appear in . However, directly computing may produce a tree language containing trees with several occurrences of . Therefore, we have to remove first an occurrence of in , by computing , then considering the substitution of the other occurrences of by in , and composing the newly created in with the quotient of : ;

  • When , the occurrences of that have to be removed by quotienting may also appear in . In this case, an occurrence of has to be substituted by , and the occurrences of in are still replaced by : .

This is illustrated in the next lemma.

Lemma 2

Let be a graded alphabet. Let be a -ary tree in and be a -homogeneous language. Let be a symbol in and be a symbol in . Then:

Hence, as a direct consequence of Lemma 2, since :

Proposition 5

Let be a graded alphabet. Let be a -homogeneous language and be a -homogeneous language. Let be a symbol in and be a symbol in . Then:

Let us now explain how to quotient a tree obtained via the composition w.r.t. a ary symbol . Composing a -ary tree , satisfying , with trees is the action of grasping these trees to at the positions where appear. Hence, the resulting tree can be viewed as a tree with an upper part containing and the lower parts containing exactly. Then, if appears in a lower tree , this tree has to be quotiented w.r.t. and the other trees are -incremented. Moreover, if some trees in are equal to , let us say , and if appears in , then has to be substituted by and the other lower trees such that , -incremented, since the inverse operations produce . More formally,

Lemma 3

Let be a graded alphabet. Let be a -ary tree with and be trees. Let be a symbol in . Then:

Therefore, since :

Proposition 6

Let be a graded alphabet. Let be a -homogeneous language with and be tree languages. Let be a symbol in . Then:

Finally, the two iterated operations are quotiented as follows.

The iterated composition can be quotiented by a -ary symbol with . If , since is obtained by applying an arbitrary number of times the composition, then quotienting w.r.t. is quotienting tree in w.r.T. and grasping it to the language obtained by an arbitrary number of application of the composition, that is . Equivalently, the occurrence of to remove appears in a lower part of a tree. However, when , the occurrence of to remove can appear everywhere: it can be localized under an upper tree in but above a lower tree in too, that is when the tree to quotient belongs to and when the occurrence of to remove appears in . In this case, has to be quotiented w.r.t. , creating an occurrence of , and then the former unique -index of has to be incremented, in line with the definition of the bottom up quotient. Hence:

Proposition 7

Let be a graded alphabet. Let be a -homogeneous language. Let be a symbol in . Then:

In the case of the iterated -product , two cases are considered when quotienting w.r.t. : when , then one occurrence of in a tree in has to be transformed into , whereas the other may still be substituted by . But when , then the situation is more complex: likely to the second case of the iterated composition, the occurrence to be removed may appear everywhere: it can be localized under an upper tree in when it was substituted from an occurrence of , but above a lower tree in too, if it also contains an occurrence of . This may occurs when the tree to quotient belongs to and when the occurrence of to remove appears in . In this case, has to be quotiented first w.r.t. in order to create a new occurrence of , where the quotient is grasped. Then a -product is added, since any occurrence of still may be substituted by . Consequently:

Proposition 8

Let be a graded alphabet. Let be a -homogeneous language. Let be a symbol in and be a symbol in . Then:

In the following section, we show how to make use of these quotients in order to compute the minimal tree DFA associated with a -homogeneous recognizable tree language.

4 The Bottom-Up Quotient Automaton

Let be a (non-necessarily deterministic) tree automaton and be a state in . The top language of is . The down language of is the tree language . Hence, a state is accessible if and only if is not empty. The bottom up quotient is related to the top language of a state as follows:

Proposition 9