Construction of rational expression from tree automata using a generalization of Arden’s Lemma

Construction of rational expression from tree automata using a generalization of Arden’s Lemma

Abstract

Arden’s Lemma is a classical result in language theory allowing the computation of a rational expression denoting the language recognized by a finite string automaton. In this paper we generalize this important lemma to the rational tree languages. Moreover, we propose also a construction of a rational tree expression which denotes the accepted tree language of a finite tree automaton.

Keywords:
T

ree automata theory, Arden’s lemma, Rational expression.

1 Introduction

Trees are natural structures used in many fields in computer sciences like XML [XML1], indexing, natural language processing, code generation for compilers, term rewriting [tata2007], cryptography [DBLP:conf/stacs/2004] etc. This large use of this structure leads to concider the theoretical basics of a such notion.

In fact, in many cases, the problem of trees blow-up causes difficulties of storage and representation of this large amount of data. To outcome this problem, many solutions persist. Among them, the use of tree automata and rational tree expressions as compact and finite structures that recognize and represent infinite tree sets.

As a part of the formal language theory, trees are considered as a generalization of strings. Indeed in the late of 1960s [Brain97, magidor1969finite], many researches generalize strings to trees and many notions appeared like tree languages, tree automata, rational tree expressions, tree grammars, etc.

Since tree automata are beneficial in an acceptance point of view and the rational expressions in a descriptive one, an equivalence between the two representations must be resolved. Fortunately, Kleene result [TH68] states this equivalence between the accepted language of tree automata and the language denoted by rational expressions.

Kleene theorem proves that the set of languages denoted by all rational expressions over the ranked alphabet noted and the set of all recognized languages over noted are equivalent. This can be checked also by verifying the two inclusions and where . In other words, any tree language is recognized by some automaton if and only if it is denoted by some rational expression. Thus two constructions can be pulled up.

From a rational expression to tree automata, several techniques exist. First, Kuske et Meinecke  [DBLP:journals/ita/KuskeM11] generalize the notion of languages partial derivation [DBLP:journals/tcs/Antimirov96] from strings to trees and propose a tree equation automaton which is constructed from a derivation of a linearized version of rational expressions. They use the ZPC structure [DBLP:journals/ijac/ChamparnaudZ01] to reach best complexity. After that, Mignot et al. [DBLP:journals/corr/MignotSZ14] propose an efficient algorithm to compute this generalized tree equation automata. Next, Laugerotte et al. [DBLP:conf/lata/LaugerotteSZ13] generalize position automata to trees. Finally, the morphic links between these constructions have been defined in [AFL2014].

In this paper, we propose a construction of the second way of Kleene Theorem, the passage from a tree automaton to its rational tree expression. For this reason we propose a generalization of Arden’s Lemma for strings to trees. The complexity of a such construction is exponential.

Section 2 recalls some preliminaries and basic properties. We generalize the notion of equation system in Section 3. Next the generalization of Arden’s lemma to trees and its proof is given in Section 4, leading to the computation of some solutions for particular recursive systems. Finally, we show how to compute a rational expression denoting the language recognized by a tree automaton in Section 5.

2 Preliminaries and Basic Properties

Let be a graded alphabet. A tree over is inductively defined by with and any trees over . A tree language is a subset of . The subtrees set of a tree is defined by . This set is extended to tree languages, and the subtrees set of a tree language is . The height of a tree in is defined inductively by where is a symbol in and are any trees over .

A finite tree automaton (FTA) over is a -tuple where is a finite set of states, is the set of final states and is a finite set of transitions. The output of , noted , is a function from to inductively defined for any tree by . The accepted language of is . The state language (also known as down language [conf/stringology/CleophasKSW09]) of a state is defined by . Obviously,

(1)

In the following of this paper, we consider accessible FTAs, that are FTAs any state of which satisfies . Obviously, any FTA admits an equivalent accessible FTA obtained by removing the states the down language of which is empty.

Given a symbol in , the -product is the operation defined for any tree in and for any tree language by

(2)

This -product is extended for any two tree languages and by . In the following of this paper, we use some equivalences over expressions using some properties of the -product. Let us state these properties of the -product. As it is the case of catenation product in the string case, it distributes over the union:

Lemma 1

Let , and be three tree languages over . Let be a symbol in . Then:

Proof

Let be a tree in . Then:

Another common property with the catenation product is that any operator is associative:

Lemma 2

Let and be any two trees in ), let be a tree language over and let be a symbol in . Then:

Proof

By induction over the structure of .

  1. Consider that . Then .

  2. Consider that . Then .

  3. Let us suppose that with . Then, following Equation (2):

    (Induction hypothesis)

Corollary 1

Let , and be any three tree languages over a graded alphabet and let be a symbol in . Then:

However, the associativity is not necessarily satisfied if the substitution symbols are different; as an example, . Finally, the final common property is that the operation is compatible with the inclusion:

Lemma 3

Let be a tree over , and let be two tree languages over . Then:

Proof

By induction over the structure of .

  1. Consider that . Then .

  2. Consider that . Then .

  3. Let us suppose that .

    Then
    By induction hypothesis,
    Therefore,

Corollary 2

Let , be any three tree languages over and let be a symbol in . Then:

The first property not shared with the classical catenation product is that the -product may distribute over other products:

Lemma 4

Let , and be any three trees in . Let and be two distinct symbols in such that does not appear in . Then:

Proof

By induction over .

  1. If , then

  2. If , then

  3. If , then

  4. If with , then, following Equation (2):

    (Induction Hypothesis)

Corollary 3

Let , and be any three tree languages over . Let and be two distinct symbols in such that . Then:

In some particular cases, two products commute:

Lemma 5

Let , and be any three trees in . Let and be two distinct symbols in such that does not appear in and such that does not appear in . Then:

Proof

By induction over .

  1. If , then

  2. If , then

  3. If , then

  4. If then, following Equation (2):

    (Induction Hypothesis)

The iterated -product is the operation recursively defined for any integer by:

(3)
(4)

The -closure is the operation defined by . Notice that, unlike the string case, the products may commute with the closure in some cases:

Lemma 6

Let and be any two tree languages over . Let and be two distinct symbols in such that . Then:

Proof

Let us show by recurrence over the integer that .

  1. If , then, according to Equation (3)):

  2. If , then, following Equation (4)):

    (Lemma 1)
    (Corollary 3)
    (Induction Hypothesis)

As a direct consequence, . ∎

A rational expression over is inductively defined by:

where is any symbol in , is any symbol in and are any rational expressions. The language denoted by is the tree language inductively defined by:

where is any symbol in , is any symbol in and are any rational expressions. In the following of this paper, we consider that rational expressions include some variables. Let be a set of variables. A rational expression over is inductively defined by:

where is any symbol in , is any symbol in , is any integer and are any rational expressions over . The language denoted by an expression with variables needs a context to be computed: indeed, any variable has to be evaluated according to a tree language. Let be a -tuple of tree languages over . The -language denoted by is the tree language inductively defined by:

where is any symbol in , is any symbol in , is any integer and are any rational expressions over . Two rational expressions and with variables are equivalent, denoted by , if for any tuple of languages over , . Let . Two rational expressions and with variables are -equivalent, denoted by , if for any tuple of languages over , . By definition,

(5)

Notice that any expression over is also an expression over . However, two equivalent rational expressions over are not necessarily equivalent as rational expressions over . As an example, is equivalent to as expressions over , but not as expressions over :

In the following, we denote by the expression obtained by substituting any symbol by the expression in the expression . Obviously, this transformation is inductively defined as follows:

where is any symbol in , are two variables in , is any symbol in , is any symbol in and are any rational expressions over . This transformation preserves the language in the following case:

Lemma 7

Let be an expression over an alphabet and over a set of variables. Let be a rational expression over . Let be a variable in . Let be a -uple of tree languages such that . Then:

Proof

By induction over the structure of .

  1. If with and , .

  2. If , then . Therefore

  3. If , with , then:

    (Induction Hypothesis)
  4. If , then

    (Induction Hypothesis)
  5. If , then

    (Induction Hypothesis)
  6. If , then

    (Induction Hypothesis)

In the following, we denote by the set of the operators that appear in a rational expression . The previous substitution can be used in order to factorize an expression w.r.t. a variable. However, this operation does not preserve the equivalence; e.g.

Nevertheless, this operation preserves the language if it is based on a restricted alphabet:

Proposition 1

Let be a rational expression over a graded alphabet and over a set of variables. Let be a variable in . Let be the subset defined by . Let be a symbol not in . Then:

Proof

By induction over the structure of .

  1. If , then since , it holds from Equation (5) that .

  2. If , since does not appear in , it holds .

  3. If , then

    (Equation (2))
    (Induction hypothesis)
  4. If , then

    (Lemma 1)
  5. If , then

    (Corollary 3)
  6. If , then

    (Lemma 6)

3 Equations Systems for Tree Languages

Let be an alphabet and be a set of variables. An equation over is an expression , where is any integer and is a rational expression over . An equation system over is a set of equations. Let be a -tuple of tree languages. The tuple is a solution for an equation if . The tuple is a solution for if for any equation in , is a solution of .

Example 1

Let us define the equation system as follows:

The tuple is a solution for the equation , but not of the system .

Two systems over the same variables are equivalent if they admit the same solutions. Notice that a system does not necessarily admit a unique solution. As an example, any language is a solution of the system . Obviously,

Proposition 2

If only contains equations with a rational expression without variables, then is the unique solution of .

Let us now define the operation of substitution, computing an equivalent system.

Definition 1

Let be an equation system. The substitution of in is the system