Factorised Representations of Query Results1footnote 11footnote 1A preliminary version has been submitted for publication on March 1, 2011.

Factorised Representations of Query Results111A preliminary version has been submitted for publication on March 1, 2011.

Dan Olteanu and Jakub Závodný
Computing Laboratory, University of Oxford
Wolfson Building, Parks Road, OX1 3QD, Oxford, UK

Query tractability has been traditionally defined as a function of input database and query sizes, or of both input and output sizes, where the query result is represented as a bag of tuples. In this report, we introduce a framework that allows to investigate tractability beyond this setting. The key insight is that, although the cardinality of a query result can be exponential, its structure can be very regular and thus factorisable into a nested representation whose size is only polynomial in the size of both the input database and query.

For a given query result, there may be several equivalent representations, and we quantify the regularity of the result by its readability, which is the minimum over all its representations of the maximum number of occurrences of any tuple in that representation. We give a characterisation of select-project-join queries based on the bounds on readability of their results for any input database. We complement it with an algorithm that can find asymptotically optimal upper bounds and corresponding factorised representations.

1 Introduction

This paper studies properties related to the representation of results of select-project-join queries under bag semantics. In approaching this challenge, we depart from the standard flat representation of query results as bags of tuples and consider nested representations of query results that can be exponentially more succinct than a mere enumeration of the result tuples. The relationship between a flat representation and a nested, or factorised, representation is on a par with the relationship between logic functions in disjunctive normal form and their equivalent nested forms obtained by algebraic factorisation. When compared to flat representations of query results, factorised representations are both succinct and informative.

Cust ckey name
1 Joe
2 Dan
3 Li
4 Mo
 Ord ckey okey date
  1 1 1995
  1 2 1996
  2 3 1994
  2 4 1993
  3 5 1995
  3 6 1996
 Item okey disc
  1 0.1
  1 0.2
  3 0.4
  3 0.1
  4 0.4
  5 0.1
Figure 1: A TPC-H-like database.
Example 1.

Consider a simplified TPC-H scenario with customers, orders, and discounted line items, as depicted in Figure 1. Each tuple is annotated with an identifier. The query reports all customers together with their orders and line items per order. A flat representation of the result is presented below:

ckey name okey date disc
1 Joe 1 1995 0.1
1 Joe 1 1995 0.2
2 Dan 3 1994 0.4
2 Dan 3 1994 0.1
2 Dan 4 1993 0.4
3 Li 5 1995 0.1

For each result tuple, the identifiers of tuples that contributed to it are shown. For instance, the input tuples with identifiers , , and contribute to the first result tuple. Our factorised representation is based on an algebraic factorisation of a polynomial that encodes the result. This encoding is constructed as follows. Each result tuple is annotated with a product of identifiers of tuples contributing to it. The whole result is then a sum of such products. For this example, the sum of products of identifiers is:

An equivalent nested expression would be:

A factorised representation of the result is an extension of this nested expression with values from the result tuples:

To correctly interpret this representation as a relation, we also need a mapping of identifiers to schemas. For instance, the identifiers to are mapped to , which serves as schema for tuples , , and .

We can easily recover the result tuples from the factorised representation with polynomial delay, i.e., the delay between two successive tuples is polynomial in the size of the representation. For this, consider the parse tree of the representation. The inner nodes stand for product or sum, and the leaves for identifiers with tuples. A result tuple is a concatenation of the tuples at the leaves after choosing one child for each sum and all children for each product. We assume here that from a user perspective, iterating over the result with small delay is more important than presenting the whole result at once.

Factorised representations can be more informative than flat representations in that they better explain the result and spell out the extent to which certain input fields contribute to result tuples either individually or in groups with other fields. This enables a shift in the presentation of the result from a tuple-by-tuple view to a kernel view, in which commonalities across result tuples are made explicit by exploiting the factorised representation. We can depict it graphically as its parse tree or textually as a serialisation of this tree in tabular form.

Example 2.

The textual presentation of our factorised representation in Example 1 could be the left one below:

 ckey name okey date disc
 1 Joe 1 1995 0.1
 2 Dan 3 1994 0.4
4 1993 0.4
 3 Li 5 1995 0.1
 name items
 Joe LCD
 Dan x LED
 Mo BW

It is easy to see that two discounted line items (with discount 0.1 and 0.2) are for the same order 1 of customer Joe.

Consider now the following factorised representation

where to identify suppliers, and to identify items. This representation encodes that Joe, Dan, and Li supply both LCD and LED TV sets, and Mo supplies BW TV sets. A textual presentation of this result could be the right one above. The blocks between the horizontal lines encode tuples obtained by combining any of the names with any of the items. This relational product is suggested by the x symbol between the blocks. (We skip the details on the mapping between the parse trees of factorised expressions and their tabular presentations.)

In the factorised representation and in contrast to its equivalent flat representation , each identifier only occurs once. We seek good factorised representations of a query result in which each identifier occurs a small number of times. The maximum number of occurrences of any identifier in a representation, or in any of its equivalent representations, defines the readability of that representation. Readability implies bounds on the representation size. In our example, the size of the factorised representation is at most linear in the size of the input database, since its readability is one.

Our study of readability is with respect to tuple identifiers and aligns well with query evaluation under bag semantics. This is different from readability with respect to values. For instance, has readability one, yet a value may occur several times in the tuples of , e.g., the discount value of 0.1. Studying readability with respect to values is especially relevant to query evaluation under set semantics.

2 Contributions

The main contributions of this paper are as follows.

  • We introduce factorised representations, a succinct and complete representation system for (results of queries in) relational databases. In contrast to the standard tabular representation of a bag of tuples, factorised representations can be exponentially more succinct by factoring out commonalities across tuples. They also allow for an intuitive presentation, whereby commonalities across tuples are made explicit.

  • We give lower and upper bounds on the readability of basic queries with equality or inequality joins.

The following holds for select-project-join queries with equality joins.

  • We introduce factorisation trees that define generic classes of factorised representations for query results. Such trees are statically inferred from the query and are independent of the database instance. A factorised representation modelled on has the nesting structure of for any input database.

  • We give a tight characterisation of queries based on their readability with respect to factorisation trees. For any query , we can find a rational number such that the readability of is at most for any database , while for any factorisation tree there exist databases for which the factorisation of modelled on has at least occurrences of some identifier.

  • For any query , we present an algorithm that iterates over the factorisation trees of and finds an optimal one . Given , we present a second algorithm that computes in time for any database a factorised representation of with readability at most and at most occurrences of identifiers.

  • Our characterisation captures as a special case the known class of hierarchical non-repeating queries [dalvi07efficient] that have readability one [OH2008]. We also show that non-hierarchical non-repeating queries have readability for arbitrarily large databases .

Section 10 shows how to extend the above results to selections that contain equalities with constants. Proofs are deferred to the appendix.

3 Related Work

Our study has strong connections to work on readability of Boolean functions, provenance and probabilistic databases, streamed query evaluation, syntactic characterisations of queries with polynomial time combined complexity or polynomial output size, and selectivity estimation in relational engines. The present work is nevertheless unique in its use of succinct nested representations of query results.

The notion of readability is borrowed from earlier work on Boolean functions, e.g., [Golumbic06a, Golumbic08, Elbassioni09]. Like in our case, a formula is read-m if each variable appears at most times in , and the readability of a formula or a function is the smallest number such that there is a read- formula equivalent to . Checking whether a monotone function in disjunctive normal form has readability can be done in time linear in both the number of terms and number of variables [Golumbic08]. This problem is open for , and already hard for or for and monotone nested functions [Elbassioni09]. This strand of work differs from ours in two key points. Firstly, we only consider algebraic, and not Boolean, equivalence; in particular, idempotence () is not considered since a reduction in the arity of any product in the representation would violate the mapping between tuple fields and schemas. Secondly, we only consider functions/formulas arising as results of queries, and classify queries based on worst-case analysis of the readability of their results.

The hierarchical property [dalvi07efficient] of queries plays a central role in studies with seemingly disparate focus, including the present one, probabilistic databases, and streamed query evaluation. Our characterisation of query readability essentially revolves around how far the query is from its hierarchical subqueries. We show that, within the class of queries without repeating relation symbols, the readability of any non-hierarchical query is dependent on the size of the input database, while for any hierarchical query, the readability is always one. This latter result draws on earlier work in the context of probabilistic databases [OH2008, OHK2009, FinkOlteanu:ICDT:2011], where read-once polynomials over random variables are useful since their exact probability can be computed in polynomial time. Read- functions for are of no use in probabilistic databases, since probability computation for such functions over random variables is #P-hard [Vadhan2001]. In our case, however, readability polynomial in the sizes of the input database and query is acceptable, since it means that the size of the result representation is polynomial, too.

Mirroring the dichotomies in the probabilistic and query readability contexts, it has been recently shown that the hierarchical property divides queries that can be evaluated in one pass from those that cannot in the finite cursor machine model of computation [Grohe:TCS:2009]. In this model, queries are evaluated by first sorting each relation, followed by one pass over each relation. It would be interesting to investigate the relationship between the readability of a query and the number of passes necessary in this model to evaluate .

Our study fits naturally in the context of provenance management [Green:PODS:2007]. Indeed, the polynomials over tuple identifiers discussed in Example 1 are provenance polynomials and nested representations are algebraic factorisations of such polynomials. In this sense, our work contributes a characterisation of queries by readability and size of their provenance polynomials.

Earlier work in incomplete databases has introduced a representation system called world-set decompositions [OKA08gWSD] to represent succinctly sets of possible worlds. Such decompositions can be seen as factorised representations whose structure is a product of sums of products.

There exist characterisations of conjunctive queries with polynomial time combined complexity [AHV95]. The bulk of such characterisations is for various classes of Boolean queries under set semantics. In this context, even simple non-Boolean conjunctive queries such as a product of relations would require evaluation time exponential in . Our approach exposes the simplicity of this query, since its readability is one and the smallest factorised representation of its result has linear size only and can be computed in linear time. Factorised representations could thus lead to larger classes of tractable queries.

Finally, there has been work on deriving bounds on the cardinality of query results in terms of structural properties of queries [Gottlob99, AGM08, Gottlob09a]. Our work uses the results in [AGM08] and quantifies how much they can be improved due to factorised representations.

4 Preliminaries

Databases. We consider relational databases as collections of annotated relation instances, as in Example 1. Each relation instance is a bag of tuples in which each tuple is annotated by an identifier. We denote by the set of identifiers in , by the schema of , and call the pair its signature.

The size of a relation instance is the number of tuples in , denoted by . The number of distinct tuples in is denoted by . The size of a database is the total number of tuples in all relations of .

Remark 1.

For the purpose of analysing the complexity of our algorithms, we assume that the tuples in the input database are of constant size. In many scenarios, this is however not realistic since even the encodings of the tuple identifiers must have size at least logarithmic in . If the maximal size of a tuple in is , the time complexity increases by an additional factor or similar, depending on the exact computation model used.

Queries. We consider conjunctive or select-project-join queries written in relational algebra but with evaluation under bag semantics. Such queries have the form , where are relations, is a conjunction of equalities of the form with attributes and , and is a list of attributes of relations to . The size of the query is the total number of relations and attributes in .

Let be a query and be a database containing a relation instance of the correct schema for each relation in . The result of the query on the database is a relation instance whose tuples are exactly those for which and . The tuple is annotated by , where is the identifier of in .

Every query can be brought into an equivalent form where all relations as well as all their attributes are distinct. To recover the original query from the rewritten one , we keep a function that maps the relations in to relations in , and the attributes of in to those of in . For technical reasons, we will only consider the rewritten queries in further text, the mapping will carry the information about different relation symbols representing the same relation. If a query has two relations with the same mapping , then is repeating; otherwise, is non-repeating.

For any attribute , let be its equivalence class, that is, the set of all attributes that are transitively equal to in , and let be the set of relations that have attributes in .

A query is hierarchical222The original definition [dalvi07efficient] does not consider the output attributes when checking the hierarchical property., if for any two attributes and , either , or , or .

Example 3.

The query from Example 1 in the introduction is non-repeating and not hierarchical.

Consider the relations , , and over schemas , , and respectively. The query is not hierarchical (independently of the set ), since , , but . The query , equivalent to , is hierarchical, since .

5 Factorised Representations

In this section we formalise the notion of factorised representations, their algebraic equivalence, and readability. We also give tight bounds on the readability of certain factorised representations that are used in the next sections to derive bounds on the readability of query results.

Definition 1.

A factorised representation, or f-representation for short, over a set of signatures is

  • , where to are f-representations over Sign, or

  • , where to are f-representations over to , respectively, and these signatures form a disjoint cover of , or

  • , where and is a tuple over schema , and .

The polynomial of is without tuples on identifiers. The size of (the polynomial of) is the total number of occurrences of identifiers in .

Two examples of f-representations are given in Section 1. A relational database can have several algebraically equivalent f-representations, in the sense that these f-representations represent the same tuples and polynomials. Syntactically, we define equivalence of f-representations as follows.

Definition 2.

Two f-representations are equivalent if one can be obtained from the other using distributivity of product over sum and commutativity of product and sum.

Each f-representation has an equivalent flat f-representation, which is a sum of products. A product defines the tuple over schema , which is a concatenation of tuples to , and is annotated by the product .

Definition 3.

The relation encoded by an f-representation consists of all tuples defined by the products in the flat f-representation equivalent to .

Since flat f-representations are standard relational databases annotated with identifiers, it means that any relational database can be encoded as an f-representation. This property is called completeness.

Proposition 1.

Factorised representations form a complete representation system for relational data.

In particular, this means that there are f-representations of the result of any query in a relational database.

Definition 4.

Let be a query, and be a database. An f-representation encodes the result if its equivalent flat f-representation contains exactly those products for which , and is the identifier of for all .

The signature set of consists of the signatures for each query relation , such that is the set of identifiers of the relation instance in corresponding to , and is the schema of in restricted to the attributes in .

Flat f-representations can be exponentially less succinct than equivalent nested f-representations, where the exponent is the size of the schema.

Proposition 2.

Any flat representation equivalent to the f-representation over the signatures and has size .

In addition to completeness and succinctness, f-representations allow for efficient enumeration of their tuples.

Proposition 3.

The tuples of an f-representation can be enumerated with delay and space.

Besides the size, a key measure of succinctness of f-representations is their readability. We extend this notion to query results for any input database in Section 7.

Definition 5.

An f-representation is read- if the maximum number of occurrences of any identifier in is . The readability of is the smallest number such that there is a read- f-representation equivalent to .

Since the readability of is the same as of its polynomial, we will use polynomials of f-representations when reasoning about their readability.

Example 4.

In Example 1, the polynomial is read-3 and the polynomial is read-1. They are equivalent and hence both have readability one.

Given the readability and the number of distinct identifiers of a polynomial, we can immediately derive an upper bound on its size. A better upper bound can be obtained by taking into account the (possibly different) number of occurrences of each identifier. However, for polynomials of query results, the bound is often dominated by the readability .

In Section 7, we define classes of queries that admit polynomials of low readability, such as constant readability. We next give examples of polynomials with readability depending polynomially on the number of identifiers.

Lemma 1.

The polynomial has readability .

Lemma 1 can be generalised as follows.

Theorem 1.

The readability of the polynomial is .

If we drop the set of identifiers , the readability becomes one. However, if we restrict the relationship between the remaining identifiers, the readability increases again.

Theorem 2.

The readability of the polynomial is
and .

The polynomials and are relevant here due to their connection to queries: is the polynomial of the query , where and the schemas of , , and are , , and respectively, on the database where , and are full relations with and . Also, is the polynomial of the disequality query . If is replaced by in , the lower and upper bounds on readability on this new polynomial still hold, and we obtain the result of an inequality query.

A lower bound of on the readability of is already known even in the case when Boolean factorisation is allowed [Golumbic06a].

6 Factorisation Trees

Figure 2: F-trees for the query in Example 5.

We next introduce a generic class of factorised representations for query results, constructed using so-called factorisation trees, whose nesting structure and readability properties can be described statically from the query only. We present an algorithm that, given a factorisation tree of a query , and an input database , computes a factorised representation of , whose nesting structure is that defined by . Factorisation trees are used in Section 7 to obtain bounds on the readability of queries.

Definition 6.

A factorisation tree (f-tree) for a query is a rooted unordered forest , where

  • there is a one-to-one mapping between inner nodes in and equivalence classes of attributes of ,

  • there is a one-to-one mapping between leaf nodes in and relations in , and

  • the attributes of each relation only appear in the ancestors of its leaf.

Example 5.

Consider the relations , , , and over schemas , , , and respectively, and the query with . Figure 2 depicts two f-trees for .

Consider now the query with . Figure 7 on page 7 shows two f-trees for as well as a partial tree that cannot be extended to an f-tree since the attributes and of lie in different branches.

Each f-tree for is a recipe for producing an f-representation of the result for any database . For a given query and database , this f-representation is called the -factorisation of and is denoted by . Figure 3 gives a recursive function that computes the -factorisation of . A more detailed implementation of this function, including an analysis of its time and space complexity, is given in Section 9.

Figure 3: The -factorisation of a query result is computed as , where is the constant true (an empty conjunction). For a relation in , is the corresponding relation instance in the input database .

The function recurses on the structure of . The parameter is a conjunction of equality conditions that are collected while traversing the f-tree top-down. Initially, is an empty conjunction . In case is a forest , we return the f-representation defined by the product of f-representations of each tree in . If is single tree with root and children , we return the f-representation of a sum over all possible domain values of the attributes in of the f-representations of the children . To compute these, for each possible value we simply recurse on , appending to the equality condition . Finally, in case is a leaf , we return a sum of f-representations for result tuples in , that is, only those tuples that satisfy . (When evaluating the selection with on , we only consider the equalities on attributes of .) In the f-representation we only include attributes from ’s projection list, along with the tuple identifier.

The symbolic products and sums in Figure 3 are of course expanded out to produce a valid f-representation. However, we will often keep the sums symbolic, abbreviate to and write instead of for the expression generated by the leaves. The condition can be inferred from the position in the expression, so we can still recover the original representation and write out the sums explicitly. Such an abbreviated form is independent of the database and conveniently reveals the structure of any -factorisation.

1 1 1
1 2 2
2 1 2
2 2 1
1 1 1
1 1 2
1 2 1
2 1 1
1 2
2 1
2 2
1 1
2 1
2 2
Figure 4: Database used in Example 6.
Example 6.

Consider the query from Example 5 and the f-trees from Figure 2. For any database, the left f-tree yields

while the right f-tree yields

both in abbreviated form. A procedure to produce the explicit form of is shown in Figure 5.

For the particular database given in Figure 4, the f-representations and yield the polynomials

They are equivalent to each other and to the polynomial of the flat f-representation of ,

Whereas is read-, both and are read-.

foreach value Dom do output sum of

Figure 5: A procedure for producing -factorisations in explicit form. The abbreviated form is . is the left f-tree in Figure 2.
Remark 2.

For any query , consider the f-tree in which the nodes labelled by the attribute classes all lie on a single path, and the leaves labelled by the relations are all attached to the lowest node in that path. Such a tree produces the -factorisation in which we sum over all values of all attributes and for each combination of values we output the product over all relations of the sums of tuples which have the given values. If all the tuples in the input relations are distinct, the -factorisation is just a sum of products, that is, the flat f-representation of the result.

Thus, for a non-branching tree we obtain a flat representation of . The more branching the tree has, the more factorised the -factorisation of is.

The correctness of our construction for a general query and database is established by the following result.

Proposition 4.

For any f-tree of a query and any database , is an f-representation of .

We next introduce definitions concerning f-trees for later use. Consider an f-tree of a query . An inner node of is relevant to a relation if it contains an attribute of . For a relation , let be the set of inner nodes appearing on the path from the leaf to its root in , be the set of nodes relevant to , and . For example, in the left f-tree of Figure 2, and . In the right f-tree, , yet . In fact, there is no f-tree for the query in Example 5 such that for each relation . This is because the query is not hierarchical.

Proposition 5.

A query is hierarchical iff it has an f-tree such that for each relation .

The left two trees shown in Figure 7 are f-trees of a hierarchical query. The first f-tree satisfies the condition in Proposition 5, whereas the second does not.

7 Readability of Query Results

The readability of a query on a database is the readability of any f-representation of , that is, the minimal possible such that there exists a read- representation of .

In this section we give upper bounds on the readability of arbitrary select-project-join queries with equality joins in terms of the cardinality of the database . We then show that these bounds are asymptotically tight with respect to statically chosen f-trees. By this we mean that for any query , if we choose an f-tree , there exist arbitrarily large database instances for which the -factorisation of is read- with asymptotically close to our upper bound. In the next section we give algorithms to compute these bounds. We conclude the section with a dichotomy: In the class of non-repeating queries, hierarchical queries are the only queries whose readability for any database is 1 and hence independent of the size of the database.

A key result for all subsequent estimates of readability is the following lemma that states the exact number of occurrences of any identifier of a tuple in the -factorisation of as a function of the f-tree , the query , and the database .

Let be a relation of , denote by the condition the conjunction of equalities of the attributes of to corresponding values in , and denote . In the -factorisation of , multiple occurrences of the same identifier from arise from the summations over the values of attributes from . Lemma 2 quantifies how many different choices of such values in the summations thus yield a given identifier from . Recall that the projection attributes do not influence the cardinality of the query result and hence the number of occurrences of its identifiers, since we consider bag semantics.

Lemma 2.

The number of occurrences of the identifier of a tuple from in the -factorisation of is

For example, for the left f-tree in Figure 2, all identifiers in , , and occur once, whereas any identifier of may occur as many times as distinct values in , , and . For the leftmost f-tree in Figure 7, all identifiers in all relations occur once, since no relation has non-relevant nodes.

Lemma 2 represents an effective tool to further estimate the readability and size of -factorisations. Our results build upon existing bounds for query result sizes and yield readability bounds which can be inferred statically from the query. Lemma 2 can be potentially also coupled with estimates on selectivities and various assumptions on attribute-value correlations [Muralikrishna:SIGMOD:1988, Poosala:VLDB:1997, Getoor:SIGMOD:2001, Re:Cardinality:2010] to infer database-specific estimates on the readability.

7.1 Upper Bounds

Let be a database, let be a query, let be an f-tree of , and let be a relation in . Denote , by the condition restricted to the attributes of , by the query , and by the database obtained by projecting each relation in onto the attributes of .

Lemma 3.

The number of occurrences of any identifier from in the -factorisation of is at most .


By Lemma 2, the number of occurrences of is equal to

from which we obtain the desired bound by straightforward estimates:

The number of distinct tuples in an equi-join query such as can be estimated in terms of the database size using the results in [AGM08]. Intuitively, if we can cover all attributes of the query by some of its relations, then is at most the product of the sizes of these relations, which is in turn at most . This corresponds to an edge cover of size in the hypergraph of . The following result strenghtens this idea by lifting covers to a weighted version.

Definition 7.

For an equi-join query , the fractional edge cover number is the cost of an optimal solution to the linear program with variables ,

subject to
Lemma 4 ([Agm08]).

For any equi-join query and for any database , we have .

Together with Lemma 3, this yields the following bound.

Corollary 1.

The number of occurrences of any identifier from in the -factorisation of is at most .


By Lemma 3, the number of occurrences of in the -factorisation of is bounded above by . By Lemma 4, this is bounded above by , which is equal to . ∎

Corollary 1 gives an upper bound on the number of occurrences of identifiers from each relation. Let be the maximal number of relations which can contain the same identifier, that is, the maximal number of relations in mapping to the same relation name by . Defining to be the maximal possible over all relations from , we obtain an upper bound on the readability of the -factorisation of .

Corollary 2.

The -factorisation of is at most read-.

By considering the -factorisation with lowest readability, we obtain an upper bound on the readability of . Let be the minimal possible over all f-trees for .

Corollary 3.

For any query and any database , the readability of is at most .

Since , the readability of is at most .

Example 7.

For the query in Example 5 and the left f-tree in Figure 2, the relation is the only one with a non-empty query , where the condition is . Since the other relations have empty covers (thus of cost zero), we conclude that their identifiers occur at most once in the query result. We can cover with any subset of , , and . A minimal edge cover can be any of the relations, and the number of occurrences of any identifier of is thus linear in the size of that relation. The fractional edge cover number is also 1 and we obtain the same bound.

For the right f-tree in Figure 2, both and have non-empty queries and defining their non-relevant sub-query of : , where is . The attributes and can be covered by or by . A minimal cover thus has size 1. The minimal fractional edge cover has also cost 1.

Now consider a different query over the relations , , and , given by , with .

Figure 6: F-trees and for the query in Example 7.

Consider the left f-tree shown in Figure 6. For the relation , we have , and hence the restricted query will be . We need at least two of the relations to cover all attributes of , the edge cover number is thus 2. However, in the fractional edge cover linear program, we can assign to each relation the value . The covering conditions at each attribute are satisfied, since each attribute belongs to two of the relations. The total cost of this solution is only . It is in fact the optimal solution, so . It is easily seen that (since can be covered either by or , and can be covered by either or ) and (since has no attributes), so . We obtain the upper bound on the number of occurrences of identifiers from , and hence on the readability of any -factorisation.

Note however that in the right f-tree in Figure 6, each of , , and is covered by only one of its relations, and hence . Any -factorisation will therefore have readability at most linear in .

In fact, no f-tree for has , so is in this sense optimal and .

7.2 Lower Bounds

We also show that the obtained bounds on the numbers of occurrences of identifiers are essentially tight. For any query and any f-tree , we construct arbitrarily large databases for which the number of occurrences of some symbol is asymptotically as large as the upper bound.

The expression for the number of occurrences of an identifier, given in Lemma 2, states the size of a specific query result. As a first attempt to construct a small database with a large result for the query , we pick attribute classes of and let each of them attain different values. If each relation has attributes from at most one of these classes, each relation in will have size at most , while the result of will have size .

This corresponds to an independent set of nodes in the hypergraph of . We can again strenghten this result by lifting independent sets to a weighted version. Since the edge cover and the independent set problems are dual when written as linear programming problems, this lower bound meets the upper bound from the previous subsection. The following result, derived from results in [AGM08], forms the basis of our argument.

Lemma 5.

For any equi-join query , there exist arbitrarily large databases such that .

Now let be a query, let be an f-tree of and let be a relation in . Define , and as before. We can apply Lemma 5 to the expression from Lemma 2 to infer lower bounds for numbers of occurrences of identifiers in the -factorisation of .

Lemma 6.

There exist arbitrarily large databases such that each identifier from occurs in the -factorisation of at least times.

We now lift the result of Lemma 6 from the identifiers from to all identifiers in the -factorisation of .

Corollary 4.

There exist arbitrarily large databases such that the -factorisation of is at least read-.

Finally, by minimising over all f-trees , we find a lower bound on readability with respect to statically chosen f-trees.

Corollary 5.

Let be a query. For any f-tree of there exist arbitrarily large databases for which the -factorisation of is at least read-.

Example 8.

Let us continue Example 7. For the left f-tree in Figure 2, an independent set of attributes covering the relations , , and of the query is . Since only has one attribute, this is also the largest independent set, and the fractional relaxation of the maximum independent set problem has also optimal cost 1.

For the right f-tree in Figure 2 the situation is similar. A maximum independent set of attributes covering the relations and of the queries and is and has size 1.

The situation is more interesting for the query . Recall that for the left f-tree in Figure 6, , its attribute classes being . The maximum independent set for has size 1, since any two of its attribute classes are relevant to a common relation. However, the fractional relaxation of the maximum independent set problem allows to increase the optimal cost to . In this relaxation, we want to assign nonnegative rational values to the attribute classes, so that the sum of values in each relation is at most one. By assigning to each attribute class the value , the sum of values in each relation is equal to one, and the total cost of this solution is . This is used in the proof of Lemma 6 to construct databases for which the identifiers from appear at least times in the -factorisation of , thus proving the upper bound from Example 7 asymptotically tight.

Since all f-trees for have , the results in this subsection show that for any such f-tree we can find databases for which the readability of the -factorisation of is at least linear in .

7.3 Characterisation of Queries by Readability

For a fixed query, the obtained upper and lower bounds meet asymptotically. Thus our parameter completely characterises queries by their readability with respect to statically chosen f-trees.

Theorem 3.

Fix a query . For any database , the readability of is , while for any f-tree of , there exist arbitrarily large databases for which the -factorisation of is read-.

Theorem 3 subsumes the case of hierarchical queries.

Corollary 6.

Fix a query . If is hierarchical, the readability of for any database is bounded by a constant. If is non-hierarchical, for any f-tree of there exist arbitrarily large databases such that the -factorisation of is read-.

For non-repeating queries, the following result extends the above dichotomy to the case of readability irrespective of f-trees.

Theorem 4.

Fix a non-repeating query . If is hierarchical, then the readability of is 1 for any database . If is non-hierarchical, then there exist arbitrarily large databases such that the readability of is .

Figure 7: Left to right: Two f-trees and a tree which cannot be extended to an f-tree, used in Example 9.

8 Algorithms for Query Characterisation

Given a query , we show how to compute the parameter characterising the upper bound on readability. We give an algorithm that iterates over all f-trees of to find one with minimum . We further prune the space of possible f-trees to avoid suboptimal choices.

The following lemma facilitates the search for optimal f-trees. Intuitively, since the parameter depends on the costs of fractional covers of for the relations of , and since is the restriction of to the attributs of , by shrinking the sets , the fractional cover number of and hence the parameter can only decrease.

Lemma 7.

If and are f-trees for a query , and