Mining Tree-Query Associations in Graphs

# Mining Tree-Query Associations in Graphs

## Abstract

New applications of data mining, such as in biology, bioinformatics, or sociology, are faced with large datasets structured as graphs. We introduce a novel class of tree-shaped patterns called tree queries, and present algorithms for mining tree queries and tree-query associations in a large data graph. Novel about our class of patterns is that they can contain constants, and can contain existential nodes which are not counted when determining the number of occurrences of the pattern in the data graph. Our algorithms have a number of provable optimality properties, which are based on the theory of conjunctive database queries. We propose a practical, database-oriented implementation in SQL, and show that the approach works in practice through experiments on data about food webs, protein interactions, and citation analysis.

## 1Introduction

The problem of mining patterns in graph-structured data has received considerable attention in recent years, as it has many interesting applications in such diverse areas as biology, the life sciences, the World Wide Web, or social sciences. In the present work we introduce a novel class of patterns, called tree queries, and we present algorithms for mining these tree queries and tree-query associations in a large data graph. This article is based on two earlier conference papers [17].

Tree queries are powerful tree-shaped patterns, inspired by conjunctive database queries [18]. In comparison to the kinds of patterns used in most other graph mining approaches, tree queries have some extra features:

• Patterns may have “existential” nodes: any occurrence of the pattern must have a copy of such a node, but existential nodes are not counted when determining the number of occurrences.

• Moreover, patterns may have “parameterized” nodes, labeled by constants, which must map to fixed designated nodes of the data graph.

• An “occurrence” of the pattern in a data graph is defined as any homomorphism from the pattern in . When counting the number of occurrences, two occurrences that differ only on existential nodes are identified.

Past work in graph mining has dealt with node labels, but only with non-unique ones: such labels are easily simulated by constants, but the converse is not obvious. It is also possible to simulate edge labels using constants. To simulate a node label , add a special node , and express that node has label by drawing an edge from to . For an edge labeled , introduce an intermediate node with , and label node by .

A simple example of a tree query is shown in Figure ?; when applied to a food web: a data graph of organisms, where there is an edge if feeds on , it describes all organisms that compete with organism #8 for some organism as food, that itself feeds on organism #0. This pattern has one existential node, two parameters, and one distinguished node . Figure ? shows another example of a tree query; when applied to a food web, it describes all organisms that have a path of length four beneath them that ends in organism #8.

=10pt

Effectively, tree queries are what is known in database research as conjunctive queries [9]; these are the queries we could pose to the data graph (stored as a two-column table) in the core fragment of SQL where we do not use aggregates or subqueries, and use only conjunctions of equality comparisons as where-conditions. For example, the pattern of Figure ? amounts to the following SQL query on a table G(from,to):

select distinct G3.to as xfrom G G1, G G2, G G3
where G1.from=0 and G1.to=G2.from
and G2.to=8 and G3.from=G2.from

In the present work we also introduce association rules over tree queries. By mining for tree-query associations we can discover quite subtle properties of the data graph. Figure ? shows a very simple example of an association that our algorithm might find in a social network: a data graph of persons where there is an edge if considers to be a close friend. The tree query on the left matches all pairs of “co-friends”: persons that are friends of a common person (represented by an existential variable). The query on the right matches all co-friends of person #5 (represented by a parameterized node), and pairs all those co-friends to person #5. Now were the association from the left to the right to be discovered with a confidence of , with , then this would mean that the pairs retrieved by the right query actually constitute a fraction of of all pairs retrieved by the left query, which indicates (for nonnegligible ) that 5 plays a special role in the network.1

Figure ? shows quite a different, but again simple, example of a tree-query association that our algorithm might discover in a food web. With confidence , this association means that of all organisms that are not on top of the food chain (i.e., they are fed upon by some other organism), a fraction of is actually at least two down in the food chain.

The examples of tree queries and associations we just saw are didactical examples, but in Section 7 we will see more complicated examples of tree queries and associations mined in real-life datasets.

In this paper we present algorithms for mining tree queries and associations rules over tree queries in a large data graph. Some important features of these algorithms are the following:

1. Our algorithms belong to the group of graph mining algorithms where the input is a single large graph, and the task is to discover patterns that occur sufficiently often in the single data graph. We will refer to this group of algorithms as the single graph category. There is also a second category of graph mining algorithms, called the transactional category, which is explained in Section 2.

2. We restrict to patterns that are trees, such as the example in Figure ?. Tree patterns have formed an important special case in the transactional category (Section Section 2), but have not yet received special attention in the single-graph literature. Note that the data graph that is being mined is not restricted in any way.

3. The tree-query-mining algorithm is incremental in the number of nodes of the pattern. So, our algorithm systematically considers ever larger trees, and can be stopped any time it has run long enough or has produced enough results. Our algorithm does not need any space beyond what is needed to store the mining results. Thanks to the restriction to tree shapes the duplicate-free generation of trees can be done efficiently.

4. For each tree, all conjunctive queries based on that tree are generated in the tree-query-mining algorithm. Here, we work in a levelwise fashion in the sense of Mannila and Toivonen [31].

5. As in classical association rules over itemsets [3], our association rule generation phase comes after the generation of frequent patterns and does not require access to the original dataset.

6. We apply the theory of conjunctive database queries [9] to formally define and to correctly generate association rules over tree queries. The conjunctive-query approach to pattern matching allows for an efficiently checkable notion of frequency, whereas in the subgraph-based approach, determining whether a pattern is frequent is NP-complete (in that approach the frequency of a pattern is the maximal number of disjoint subgraphs isomorphic to the pattern [29]).

7. There is a notion of equivalence among tree queries and association rules over tree queries. We carefully and efficiently avoid the generation of equivalent tree queries and associations, by using and adapting what is known from the theory of conjunctive database queries. Due to the restriction to tree shapes, equivalence and redundancy (which are normally NP-complete) are efficiently checkable.

8. Last but not least, our algorithms naturally suggest a database-oriented implementation in SQL. This is useful for several reasons. First, the number of discovered patterns can be quite large, and it is important to keep them available in a persistent and structured manner, so that they can be browsed easily, and so that association rules can be derived efficiently. Moreover, we will show how the use of SQL allows us to generate and check large numbers of similar patterns in parallel, taking advantage of the query processing optimizations provided by modern relational database systems. Third, a database-oriented implementation does not require us to move the dataset out of the database before it can be mined. In classical itemset mining, database-oriented implementations have received serious attention [39], but less so in graph mining, a recent exception being an implementation in SQL of the seminal SUBDUE algorithm [8].

The purpose of this paper is to introduce tree queries and tree-query associations and to present algorithms for mining tree queries and tree-query associations. Concrete applications to discover new knowledge about scientific datasets are the topic of current research. Yet, the algorithms are fully implemented and we can already show that our approach works in practice, by showing some concrete results mined from a food web, a protein interactions graph, and a citation graph. We will also give performance results on random data graphs (as a worst-case scenario).

## 2Related Work

Approaches to graph mining, especially mining for frequent patterns or association rules, can be divided in two major categories which are not to be confused.

1. In transactional graph mining, e.g., [12], the dataset consists of many small data graphs which we call transactions, and the task is to discover patterns that occur at least once in a sufficient number of transactions. (Approaches from machine learning or inductive logic programming usually call the small data graphs “examples” instead of transactions.)

2. In single-graph mining the dataset is a single large data graph, and the task is to discover patterns that occur sufficiently often in the dataset.

Note that single-graph mining is more difficult than transactional mining, in the sense that transactional graph mining can be simulated by single-graph mining, but the converse is not obvious.

Since our approach falls squarely within the single-graph category, we will focus on that category in this section. Most work in this category has been done on frequent pattern mining, and less attention has been spend on association rules. We briefly review the work in this category next:

• Cook and Holder [11] apply in their SUBDUE system the minimum description length (MDL) principle to discover substructures in a labeled data graph. The MDL principle states that the best pattern, is that pattern that minimizes the description length of the complete data graph. Hence, in SUBDUE a pattern is evaluated on how well it can compress the entire dataset. The input for the SUBDUE system is a labeled data graph; nodes and edges are labeled with non-unique labels. This is in contrast with the unique labels (‘constants’) in our system. But as we already noted, non-unique node labels and edge-labels can easily be simulated by constants, but the converse is not obvious. The SUBDUE system only mines patterns, no association rules.

• Ghazizadeh and Chawathe [16] mine in their SEuS system for connected subgraphs in a labeled, directed data graph, as in the SUBDUE system. Instead of generating candidate patterns using the input data graph, SEuS uses a summary of the data graph. This summary gives an upper bound for the support of the patterns, and the user can then select those patterns of which he wants to know the exact support. SEuS also only mines for frequent patterns and not for associations.

• Vanetik, Gudes, and Shimony [19] propose an Apriori-like [3] algorithm for mining subgraphs from a labeled data graph. The support of a graph pattern is defined as the maximal number of edge-disjoint instances of the pattern in the data graph. By reducing the support counting problem to the maximal independent set problem on graphs, they show that in worst case, computing the support of a graph pattern is NP-hard. They propose an Apriori-like algorithm to minimize the number of patterns for which the support needs to be computed. The major idea of their approach is using edge-disjoint paths as building blocks instead of items in classical itemset mining. Vanetik, Gudes, and Shimony also only mine for frequent patterns in the data graph.

• Kuramochi en Karypis [29] use the same support measure for graph patterns as Vanetik, Gudes and Shimony [19]. They also note that computing the support of a graph pattern is NP–hard in worst case, since it can be reduced to finding the maximum independent set (MIS) in a graph. Kuramochi and Karypis quickly compute the support of a graph pattern using approximate MIS-algorithms. The number of candidate patterns is restricted using canonical labeling. As the majority of algorithms, Kuramochi and Karypis only mine for frequent patterns.

• Jeh and Widom [24] consider patterns that are, like our tree queries, inspired by conjunctive database queries, and they also emphasize the tree-shaped case. A severe restriction, however, is that their patterns can be matched by single nodes only, rather than by tuples of nodes. Still their work is interesting in that it presents a rather nonstandard approach to graph mining, quite different from the standard incremental, levelwise approach, and in that it incorporates ranking. Jeh and Widom mention association rules as an example of an application of their mining framework.

The related work that was most influential for us is Warmr [12], although it belongs to the transactional category. Based on inductive logic programming, patterns in Warmr also feature existential variables and parameters. While not restricted to tree shapes, the queries in Warmr are restricted in another sense so that only transactional mining can be supported. Association rules in Warmr are defined in a naive manner through pattern extension, rather than being founded upon the theory of conjunctive query containment. The Warmr system is also Prolog-oriented, rather than database-oriented, which we believe is fundamental to mining of single large data graphs, and which allows a more uniform and parallel treatment of parameter instantiations, as we will show in this paper. Finally, Warmr does not seriously attempt to avoid the generation of duplicates. Yet, Warmr remains a pathbreaking work, which did not receive sufficient follow-up in the data mining community at large. We hope our present work represents an improvement in this respect. Many of the improvements we make to Warmr were already envisaged (but without concrete algorithms) in 2002 by Goethals and the second author [18].

Finally, we note that parameterized conjunctive database queries have been used in data mining quite early, e.g., [39], but then in the setting of “data mining query languages”, where a single such query serves to specify a family of patterns to be mined or queried for, rather than the mining for such queries themselves, let alone associations among them.

## 3Problem Statement

In this section we define some concepts formally. In the appendix an overview of all notations used in this paper is given.

We basically assume a set of data constants from which the nodes of the data graph to be mined will be taken.

### 3.1Graph-theoretic concepts

Let be any finite set of nodes; nodes can be any data objects such as numbers or strings. For our purposes, we define a (directed) graph on as a subset of , i.e., as a finite set of ordered pairs of nodes. These pairs are called edges. We assume familiarity with the notion of a tree as a special kind of graph, and with standard graph-theoretic concepts such as root of a tree; children, descendants, parent, and ancestors of a node; and path in a graph. Any good algorithms textbook will supply the necessary background.

In this paper all trees we consider are rooted and unordered, unless stated otherwise.

### 3.2Tree Pattern

Tree Patterns A parameterized tree pattern is a tree whose nodes are called variables, and where additionally:

• Some variables may be marked as being existential;

• Some other variables may be marked as parameters;

• The variables of that are neither existential nor parameters are called distinguished.

We will denote the set of existential variables by , the set of parameters by , and the set of distinguished variables by . To make clear that these sets belong to some parameterized tree pattern we will use a subscript as in or .

A parameter assignment , for a parameterized tree pattern , is a mapping which assigns data constants to the parameters.

An instantiated tree pattern is a pair , with a parameterized tree pattern and a parameter assignment for . We will also denote this by .

When depicting parameterized tree patterns, existential nodes are indicated by labeling them with the symbol ’’ and parameters are indicated by labeling them with the symbol ’’. When depicting instantiated tree patterns, parameters are indicated by directly writing down their parameter assignment.

Figure ? shows an illustration.

Matching Recall that a homomorphism from a graph to a graph is a mapping from the nodes of to the nodes of that preserves edges, i.e., if then . We now define a matching of an instantiated tree pattern in a data graph as a homomorphism from the underlying tree of to , with the constraint that for any parameter , if , then must be the node . We denote the set by .

Frequency of a tree pattern The frequency of an instantiated tree pattern in a data graph , is formally defined as the cardinality of . So, we count the number of matchings of in , with the important provision that we identify any two matchings that agree on the distinguished variables. Indeed, two matchings that differ only on the existential nodes need not be distinguished, as this is precisely the intended semantics of existential nodes. Note that we do not need to worry about selected nodes, as all matchings will agree on those by definition. For a given threshold (a natural number) we say that is -frequent if its frequency is at least . Often the threshold is understood implicitly, and then we talk simply about “frequent” patterns and denote the threshold by minsup.

### 3.3Tree Query

Tree Queries A parameterized tree query is a pair where:

1. is a parameterized tree pattern, called the body of ;

2. is a tuple of distinguished variables and parameters coming from . All distinguished variables of must appear at least once in . We call the head of .

A parameter assignment for is simply a parameter assignment for its body, and an instantiated tree query is then again a pair with a parameterized tree query and a parameter assignment for . We will again also denote this by .

When depicting tree queries, the head is given above a horizontal line, and the body below it. Two illustrations are given in Figure ?.

Frequency of a tree query The frequency of an instantiated tree query in a data graph , is defined as the frequency of the body in . When is understood, we denote the frequency by . For a given threshold (a natural number) we say that is k-frequent if its frequency is at least . Again, this threshold is often understood implicitly, and then we talk simply about “frequent” queries and denote the threshold by minsup.

Containment of tree queries An important step towards our formal definition of tree-query association is the notion of containment among queries. Since queries are parameterized, a variation of the classical notion of containment [9] is needed in that we now need to specify a parameter correspondence.

First, we define the answer set of an instantiated tree query , with , in a data graph as follows:

Consider two parameterized tree queries and , with for . A parameter correspondence from to is any mapping . We then say that a parameterized tree query is -contained in a parameterized tree query , if for every , a parameter assignment for , for all data graphs . In shorthand notation we write this as .

Containment as just defined is a semantical property, referring to all possible data graphs, and it is not immediately clear how one could decide this property syntactically. The required syntactical notion for this is that of -containment mapping, which we next define in two steps. For the tree queries and as above, and a parameter correspondence from to :

1. A -containment mapping from to is a homomorphism from the underlying tree of to the underlying tree of , with the properties:

1. maps the distinguished nodes of to distinguished nodes or parameters of ; and

2. , i.e., for each we have .

2. Finally, a -containment mapping from to is a -containment mapping from to such that .

For later use, we note:

We will show that:

1. is homomorphism;

2. maps distinguished nodes of to distinguished nodes or parameters of ; and

3. .

Clearly is a homomorphism since both and are homomorphisms, and it is already known that a composition of homomorphisms is a homomorphism.

Consider a , then there are two possibilities for :

1. , with . Then we know, since is a -containment mapping, that is either a distinguished node , or a parameter .

2. , with . Then we know, since , that , with .

Hence, we can conclude that maps distinguished nodes of to distinguished nodes or parameters of .

For each , we have . Hence, .

From the theory of conjunctive database queries [9] we can derive the following:

Let us start with the ‘only if’ direction. We first introduce the concept of a freezing of a parameterized tree query . Recall that is the set of data constants from which the nodes of the data graph to be mined will be taken. A freezing of is then a one-to-one mapping from the nodes of to . We denote by the data graph constructed from by replacing each node of by , and we denote by the tuple constructed from by replacing each node in by the data constant .

For example, consider the parameterized tree query in Figure ?. Figure ?, shows and for the freezing given as follows: ; ; ; ; ; .

We can now continue with the proof of the ‘only if’ direction. Consider a freezing from the nodes of to . Note that is a parameter assignment for , and . Since , also . Hence, there must be a matching from in such that . Now consider the function . We show that is -containment mapping from to :

1. Clearly, is a homomorphism from to since is a homomorphism and is an isomorphism. Also the following properties hold for :

1. maps distinguished nodes of to distinguished nodes or parameters of since (as shown in (2)); and

2. for each : , hence

2. .

Hence, we conclude that is a -containment mapping from to .

Let us then look at the ‘if’ direction. Let be the -containment mapping from to . Consider an arbitrary parameter assignment for . We must prove that for every data graph , if , then also . Consider such an arbitrary data graph . Since, , we know that there exists a matching of in such that . Now consider the function . We show that is a matching from in and :

1. is a homomorphism since both and are homomorphisms; and

2. for each we have .

So, is indeed a matching of in . Finally, we observe that , as desired.

Checking for a containment mapping is evidently computable, and although the problem for general database conjunctive queries is NP-complete, our restriction to tree shapes allows for efficient checking, as we will see later.

### 3.4Tree-Query Association

Association Rules A parameterized association rule (pAR) is of the form , with and parameterized tree queries and a parameter correspondence from to . We call a pAR legal if . We call the left-hand side (lhs), and the right-hand side (rhs). A parameter assignment , for a pAR, is a mapping which assigns data constants to the parameters. An instantiated association rule (iAR) is a pair , with a pAR and a parameter assignment for . Note that while is only defined on the rhs, we can also apply it to the lhs by using first.

Confidence The confidence of an iAR in a data graph is defined as the frequency of divided by the frequency of . If the AR is legal, we know that the answer set of is a subset of the answer set of , and hence the confidence equals precisely the proportion that the answer set takes up in the answer set. Thus, our notions of a legal pAR and confidence are very intuitive and natural.

For a given threshold (a rational number, ) we say that the iAR is -confident in if its confidence in is at least . Often the threshold is understood implicitly, and then we talk simply about “confident” iARs and denote the threshold by minconf.

Furthermore, the iAR is called frequent in if is frequent in . Note that if the iAR is legal and frequent, then also is frequent, since the rhs is -contained in the lhs.

### 3.5Mining Problems

We are now finally ready to define the graph mining problems we want to solve.

#### Mining Tree Queries

Input:

A data graph ; a threshold minsup.

Output:

All frequent instantiated tree queries .

In theory, however, there are infinitely many -frequent tree queries, and even if we set an upper bound on the size of the patterns, there may be exponentially many. As an extreme example, if is the complete graph on the set of nodes , and , then any instantiated pattern with all parameters assigned to values in , and with at least one distinguished variable, is frequent.

Hence, in practice, we want an algorithm that runs incrementally, and that can be stopped any time it has run long enough or has produced enough results. We introduce such an algorithm in Section 4.

#### Association Rule Mining

Input:

A data graph ; a threshold minsup; a parameterized tree query ; and a threshold minconf.

Output:

All iARs that are legal, frequent and confident in

In theory, however, there are infinitely many legal, frequent and confident association rules for a fixed lhs, and even if we set an upper bound on the size of the rhs, there may be exponentially many. Hence, in practice, we want an algorithm that runs incrementally, and that can be stopped any time it has run long enough or has produced enough results. We introduce such an algorithm in Section 5.

## 4Mining Tree Queries

In this Section we present an algorithm for mining frequent instantiated tree queries in a large data graph. But first we show that we do not need to tackle the problem in its full generality.

### 4.1Problem Reduction

In this subsection we show that, without loss of generality, we can focus on parameterized tree queries that are ‘pure’.

Pure Tree Queries To define this formally, assume that all possible variables (nodes of tree patterns) have been arranged in some fixed but arbitrary order. We then call a parameterized tree query pure when consists of the enumeration, in order and without repetitions, of all the distinguished variables of . In particular cannot contain parameters. We call the pure head for . As an illustration, the parameterized tree query in Figure ? is pure, while the parameterized tree query in Figure ? is not pure.

A parameterized tree query that is not pure can always be rewritten to a parameterized tree query that is pure, in such a way that all instantiations of the impure query correspond to instantiations of the pure query, with the same frequency. Indeed, take a parameterized tree query . We can purify by removing all parameters and repetitions of distinguished variables from , and sort by the order on the variables. An illustration of this is given in Figure ?.

We can conclude that it is sufficient to only consider pure instantiated tree queries. As a consequence, rather than mining tree queries, it suffices to mine for tree patterns, because the frequency of a query is nothing else then the frequency of his body, i.e., a pattern. An illustration is given in Figure ?.

### 4.2Overall Approach

An overall outline of our tree-query mining algorithm is the following:

Outer loop:

Generate, incrementally, all possible trees of increasing sizes. Avoid trees that are isomorphic to previously generated ones.

Inner loop:

For each , generate all instantiated tree patterns based on , and test their frequency.

The algorithm is incremental in the number of nodes of the pattern. We generate canonically ordered rooted trees of increasing sizes, avoiding the generation of isomorphic duplicates. It is well known how to do this efficiently [37]. Note that this generation of trees is in no way “levelwise” [31]. Indeed, under the way we count pattern occurrences, a subgraph of a pattern might be less frequent than the pattern itself (this was already pointed out by Kuramochi and Karypis [29]). So, our algorithm systematically considers ever larger trees, and can be stopped any time it has run long enough or has produced enough results. Our algorithm does not need any space beyond what is needed to store the mining results. The outer loop of our algorithm will be explained in more detail in Section 4.3.

For each tree, all conjunctive queries based on that tree are generated. Here, we do work in a levelwise fashion. This aspect of our algorithm has clear similarities with “query flocks” [39]. A query flock is a user-specified conjunctive query, in which some constants are left unspecified and viewed as parameters. A levelwise algorithm was proposed for mining all instantiations of the parameters under which the resulting query returns enough answers. We push that approach further by also mining the query flocks themselves. Consequently, the specialization relation on queries used to guide the levelwise search is quite different in our approach. The inner loop of our algorithm will be explained in more detail in Section 4.4.

A query based on some tree may be equivalent to a query based on a previously seen tree. Furthermore, two queries based on the same tree may be equivalent. We carefully and efficiently avoid the counting of equivalent queries, by using and adapting what is known from the theory of conjunctive database queries. This will be discussed in Section 4.5.

### 4.3Outer Loop

In the outer loop we generate all possible trees of increasing sizes and we avoid trees that are isomorphic to previously generated ones. In fact, it is well known how to do this [37]. What these procedures typically do is generating trees that are canonically ordered in the following sense. Given an (unordered) tree , we can order the children of every node in some way, and call this an ordering of . For example, Figure ? shows two orderings of the same tree. From the different orderings of a tree , we want to uniquely select one, to be the canonical ordering of . For each such possible ordering of , we can write down the level sequence of the resulting tree. This is actually a string representation of the resulting tree. This level sequence is as follows: if the tree has nodes then this is a sequence of numbers, where the th number is the depth of the th node in preorder. Here, the depth of the root is 0, the depth of its children is 1, and so on. The canonical ordering of is then the ordering of that yields the lexicographically maximal level sequence among all possible orderings of .

For example, in Figure ?, the left one is the canonical one.

### 4.4Inner Loop

Let be the data graph being mined, and let be its set of nodes. In this section, we fix a tree , and we want to find all instantiated tree patterns based on whose frequency in is at least minsup.

This tasks lends itself naturally to a levelwise approach [31]. A natural choice for the specialization relation is suggested by an alternative notation for the patterns under consideration. Concretely, since the underlying tree is fixed, any parameterized tree pattern based on is characterized by two parameters:

1. The set of existential nodes;

2. The set of parameters.

Note that and are disjoint.

Thus, a parameterized tree pattern is completely characterized by the pair . An instantiation of is then represented by the triple . For two parameterized tree patterns and we now say that specializes if and ; and . We also say that generalizes .

Parent An immediate generalization of a tree pattern is called a parent. Formally, let and be parameterized tree patterns based on . We say that is a parent of if:

• and for some node ; or

• and for some node .

From the following lemma, it follows that specialized patterns have a lower frequency, as expected for a specialization relation:

We will show that by defining an injection .
Since is a parent of , we know that where is either an existential node or a parameter of . Note that each is of the form for some matching of . For each in , we fix arbitrarily . Now we define . To see that is an injection, let and suppose that . In other words, . In particular, , as desired.
Hence, we can conclude that and that .

The above lemma suggests the following definition of specialization among instantiated tree patterns: we say that is a specialization of
if the parameterized tree pattern is a specialization of the parameterized tree pattern , and .

Intuitively, the previous lemma then expresses that the frequency of an instantiated tree pattern is always at most the frequency of any of its instantiated parents.

#### Candidate generation

Candidate pattern A candidate pattern is an instantiated tree pattern whose frequency is not yet determined, but all whose generalizations are known to be frequent.

Using the specialization relation and the definition for a candidate pattern we explain how the levelwise search for frequent instantiated tree patterns will go.

Levelwise search We start with the most general instantiated tree pattern , and we progressively consider more specific patterns. The search has the typical property that, in each new iteration, new candidate patterns are generated; the frequency of all newly discovered candidate patterns is determined, and the process repeats.

There are many different instantiations to consider for each parameterized tree pattern. Hence, to generate candidate patterns in an efficient manner, we propose the use of candidacy tables and frequency tables. These candidacy and frequency tables allow us to generate all frequent instantiations for a particular parameterized tree pattern in parallel. A frequency table contains all frequent instantiations for a particular parameterized tree pattern.

Formally, for any parameterized pattern , we define:

Technically, the table has columns for the different parameters, plus a column freq. Note that when , i.e., has no parameters, this is a single-column, single-row table containing just the frequency of . This still makes sense and can be interpreted as boolean values; for example, if contains the empty tuple, then the pattern is frequent; if the table is empty, the pattern is not frequent. Of course in practice, all frequency tables for parameterless patterns can be combined into a single table. All frequency tables are kept in a relational database.

The following crucial lemma shows these tables can be populated efficiently.

For the ‘only-if’ direction: By definition of a candidacy table, if , then all generalizations of are frequent. In particular, for all parents of , we know that is frequent, since parents are generalizations.

For the ‘if’ direction, we must show that all generalizations of are frequent. Consider such a generalization . Let us denote the parent relation by . Then there is a sequence of parent patterns: . And we have:
. The last inequality is given by (i) or (ii), the other inequalities are given by Lemma ?.

The Join Lemma has its name because, viewing the tables as relational database tables, it can be phrased as follows:

Each candidacy table can be computed by taking the natural join of its parent frequency tables.

The only exception is when and is a singleton; this is the initial iteration of the search process, when there are no constants in the parent tables to start from. In that case, we define as the table with a single column , holding all nodes of the data graph being mined.

#### Frequency counting using SQL

The search process starts by determining the frequency of the underlying tree ; indeed, formally this amounts to computing . Similarly, for each parameterized tree pattern with , all we can do is determine its frequency, except that here, we do this only on condition that its parent patterns are frequent.

We have seen above that, if the frequency tables are viewed as relational database tables, we can compute each candidacy table by a single database query, using the Join Lemma. Now suppose the data graph that is being mined is stored in the relational database system as well, in the form of a table G(from,to). Then also each frequency table can be computed by a single SQL query.

Indeed, in the cases where this simply amounts to formulating the pattern in SQL, and determining its count (eliminating duplicates). Since our patterns are in fact conjunctive queries (or datalog rules) known from database research [2]. They can easily be translated in SQL:

• The FROM-clause consists of all table references of the form G as Gij, for all edges in .

• The WHERE-clause consists of all equalities of the form Gij.from =
Gik.from as well of equalities of the form Gij.to = Gjh.from.

• The SELECT-clause is of the from SELECT DISTINCT and consists of all column references of the form Gij.to when is a distinguished node in , plus one reference of the form G1k.from if the root node is distinguished.

The SQL query for the tree in Figure ? with and is as follows:

E = SELECT G12.from, G23.to, G24.toFROM G as G12, G as G23, G as G24
WHERE G12.to = G23.from AND G12.to = G24.from

But also when , we can compute by a single SQL query. Note that we thus compute the frequency of a large number of instantiated tree patterns in parallel! We proceed as follows:

1. we formulate the pattern in SQL; call the resulting expression

2. We then take the natural join of and , group by , and count each group.

The join with the candidacy table ensures that only candidate patterns are counted.

For instance, the SQL query to compute the frequency table for the tree in Figure ?, with and , with E as above, is as follows:

It goes without saying that, whenever the frequency table of a tree pattern is found to be empty, the search for more specialized patterns is pruned at that point.

#### The algorithm

Putting everything together so far, the algorithm is given in Algorithm ?. In outline it is a double Apriori algorithm [3], where the sets form one dimension of itemsets, and the sets another. A graphical illustration of the algorithm is given in Figure ?. In this illustration we use tries (or prefix-trees) to store the itemsets. A trie [5] is commonly used in implementations of the Apriori algorithm.

#### Example run

In this Section we give an example run of the proposed algorithm in Algorithm ?. Consider the example data graph in Figure ?; the unordered rooted tree in Figure ?; and let the minimum support threshold be .

The example run then looks as follows:

### 4.5Equivalence among Tree Patterns

In this Section we make a number of modifications to the algorithm described so far, so as to avoid duplicate work.

As an example of duplicate work, consider the parameterized tree pattern from the example run in Section ? ( and ):

and the parameterized tree pattern :

Clearly, and have the same answer set for all data graphs , up to renaming of the distinguished variables ( by ). However, these patterns have different underlying trees, and hence Algorithm ? will compute the answer set for both patterns (line ). The answer set of is computed before the answer set of , since our algorithm is incremental in the number of nodes of . Hence, we can conclude that our algorithm performs some duplicate work which we want to avoid.

Another example of duplicate work our algorithm performs: Consider the parameterized tree pattern from the example run in Section ? ( and ):

and the parameterized tree pattern also from the example run in Section ? ( and ):

As one can see in Section ?, these two parameterized patterns have the same instantiations for all data graphs , up to renaming of the parameters ( by ), and for each instantiantion, the same answer set for all data graphs , up to renaming of the distinguished variables ( by ). However, when we look at the outline of our algorithm in Algorithm ?, we see that for both patterns the candidacy and frequency tables are computed between line and line . Hence, we can conclude again that our algorithm performs duplicate work that we want to avoid.

In the rest of this Section we formalize the duplicate work our algorithm performs, and we make a number of modifications to the algorithm described so far, so as to avoid the duplicate work.

#### Equivalency

Intuitively we call two parameterized tree patterns equivalent if they have the same answer sets and the same parameter assignments for all data graphs , up to renaming of the parameters and the distinguished variables. For instance, the parameterized tree patterns and from above we call equivalent, as the tree patterns and from above.

To define equivalent parameterized tree patterns formally we introduce the notion of -equivalence.

-Equivalence Let and be two parameterized tree patterns and a parameter correspondence from to (recall Section 3.3). We define an answer set correspondence from to as any mapping . Furthermore, assume that and are bijections. We then say that and are -equivalent, denoted by , if for all data graphs , and all parameter assignments , we have , where denotes the set .

For example, consider the two parameterized tree patterns in Figure ?, and let be as follows:

and let be as follows:

The two parameterized tree patterns are clearly -equivalent, as are the three parameterized tree patterns shown in Figure ? with an empty parameter correspondence and the identity.

The parameter correspondence is a bijection in the definition of -equivalence, since intuitively we want equivalent parameterized tree patterns to have essentially the same set of instantiations. Hence it is necessary that the tree patterns have the same number of parameters. Intuitively we also want equivalent tree patterns to have the same answer sets up to renaming of the distinguished variables. That is the reason why an answer set correspondence is introduced that is a bijection.

We then define equivalent parameterized tree patterns as follows:

Equivalent parameterized tree patterns We call two parameterized tree patterns and equivalent if is -equivalent with for some bijective parameter correspondence and some bijective answer set correspondence .

Note that there can exist more than one parameter correspondence and more than one answer set correspondence for which the two parameterized tree patterns are -equivalent. An illustration of this is given in Figure ?. Let , , and be as follows: is the identity; is the identity and

Then the two tree patterns in Figure ? are clearly -equivalent and -equivalent.

Equivalence as just defined is a semantical property, referring to all possible data graphs, and it is not immediately clear how one could decide this property syntactically. The required syntactical notion is given by the following Lemma and Corollary.

Let us start with the if direction. We need to prove that for every parameter assignment for , and every data graph that . We know that since . We may rewrite this as: since is an enumeration of .

We also know that for every parameter assignment for since . Now take . We then have . Again since is an enumeration of we may rewrite this as: . Hence we can conclude that .

Let us then look at the only-if direction. To prove that , we will show that for every parameter assignment for , and every data graph , we have . Since , we have , and hence clearly .

To prove that , we will show that for every parameter assignment for , and every data graph , we have . We know that for every parameter assignment for , we have . Now take . We then have , hence , hence clearly .

Of course, we want to avoid that our algorithm considers some parameterized tree pattern if it is equivalent to an earlier considered parameterized tree pattern . Since our algorithm generates trees in increasing sizes, there are two cases to consider:

Case A:

has fewer nodes than .

Case B:

and have the same number of nodes.

Armed with the above Lemma and Corollary, we can now analyze the above two cases.

#### Case A: Redundancy checking

Let us start by defining the notion of a redundancy.

Redundant subtree A redundant subtree , is a subtree of a parameterized tree pattern , such that removing from yields a parameterized tree pattern that is equivalent with .

For example, the first two parameterized tree patterns in Figure ? indeed contain a redundant subtree.

The following lemma shows that two parameterized tree patterns with different numbers of nodes can only be equivalent if the largest one contains redundant subtrees:

Since and are equivalent we know from Corollary ? that the following exist:

1. an answer set correspondence that is a bijection;

2. a parameter correspondence that is a bijection;

3. a -containment mapping ; and

4. a -containment mapping .

with the pure head for . Since the number of nodes of is smaller than the number of nodes of , we know that some subtrees of are not in the range of . We will prove that these subtrees are redundant subtrees, by showing that and are equivalent.

Since the containment mappings and exist, we know that in particular the following containment mappings will exist:

1. , a -containment mapping from to , and

2. , a -containment mapping from to .

with and as above.

Let us now look at the following mappings: and . By Lemma ?, and are identity-containment mappings.

Using Corollary ? we can now conclude that and are (identity, identity)-equivalent and thus is a redundant subtree.

From this lemma follows that Case A can only happen if contains redundant subtrees. Hence, if we can avoid redundancies, Case A will never occur.

The following lemma provides us with an efficient check for redundancies.

Before we prove this Lemma, let us see some examples. For instance the parameterized tree patterns in Figure ? and Figure ? contain a linear chain of existential nodes that is redundant. In both tree patterns this linear chain is rooted in . Another example of such a redundancy is given in Figure ?. Here the linear chain is rooted in . Note that when we remove the linear chain rooted in , we have a new linear chain rooted in that is redundant.

Let us refer to a subtree as described in the lemma as an “eliminable path”. An eliminable path is clearly redundant, so we only need to prove the only-if direction. Let be a redundant subtree of that is maximal, in the sense that it is not the subtree of another redundant subtree. Then following Corollary ?, there must be a -containment mapping from to with and bijections and the pure head for . All distinguished variables of must be in , since is a bijection. Also all parameters of must be in , since is also a bijection. So consists entirely of existential nodes.

Furthermore, note that must fix the root of , since the height of is at least that of .

Any iteration of is a -containment mapping from to by Lemma ?. Moreover, each induces a permutation on the set of distinguished variables and parameters. Since is finite, there are only a finite number of possible permutations of , namely . Hence, there will be an iteration and an iteration such that . Hence, is the identity on , because

There are now two possible cases.

1. itself is a linear chain. Let us then look at the parent of in . Again there are two possibilities:

1. : Since is a -containment mapping from to and is redundant, we know that must be mapped to another subtree of , , that is at least as deep as . Hence, is an eliminable path. An illustration is given in Figure ?.

2. : Then can only be an existential node. We now have two possibilities:

1. is the only subtree of . We will show that the subtree , rooted in is redundant as well. Clearly we have the following containment relations:

• , a -containment mapping from to ; and

• , a -containment mapping from to .

with and as above. By Corollary ?, is then a redundant subtree. This is in contraction with the assumption that is maximal. Hence, it is impossible that has only one subtree and is existential. An illustration of this is in given in Figure ?.

2. has more than one subtree. Consider such another subtree . We will show that all subtrees of consist entirely of existential nodes. Suppose a node is not an existential node. We then know that . However, since is a homomorphism and is an ancestor of , must be . But this is in contradiction with the assumption that . So must consist entirely of existential nodes. Hence this brings us to the second case where is not a linear chain. An illustration is given in Figure ?.

2. is not a linear chain. An easy induction on the height of , shows that any non-linear tree consisting entirely of existential nodes must contain an eliminable path. If the height of is , there is an eliminable path of a single node: just choose one of the children of the root. If the height of is , consider the subtree of the root of with the smallest height, at most . Then we have two possibilities: If is a linear chain, we found our eliminable path. And if is a non-linear chain we know by induction that will contain an eliminable path. Hence, , and thus also , contains an eliminable path as desired.

As we have seen in Section 4.4, our algorithm introduces existential nodes levelwise, one by one. This makes the redundancy test provided by the redundancy lemma particularly easy to perform. Indeed, if is a parameterized tree patterns of which we already know it has no redundancies, and we make one additional node existential, then it suffices to test whether thus becomes part of a subtree as in the Redundancy Lemma. If so, we will prune the entire search at .

#### Case B: Canonical forms

We may now assume that and do not contain redundancies, for if they would, they would have been dismissed already.

Let us start by defining isomorphic parameterized tree patterns.

Isomorphic Parametrized Tree Patterns We call two parameterized tree patterns and isomorphic if there exists a homomorphism that is a bijection and that maps distinguished nodes to distinguished nodes, parameters to parameters and existential nodes to existential nodes. We call an isomorphism. Since we are working with trees, is also a homomorphism.

For example, the two parameterized tree patterns in Figure ? are indeed isomorphic with as follows:

Clearly, we have the following:

Using Corollary ? we have to show that the following exists:

1. a bijective answer set correspondence ;

2. a bijective parameter correspondence ;

3. a -containment mapping ; and

4. a -containment mapping .

with the pure head for .

Since and are isomorphic, there exists a homomorphism that is a bijection and that maps distinguished nodes to distinguished nodes, parameters to parameters and existential nodes to existential nodes.

We now take and . Then and are bijections since is a bijection.

For we will show that is -containment mapping from to , with the pure head for :

• is a homomorphism;

• maps distinguished nodes to distinguished nodes and ;

• maps parameters to parameters and