Functional Aggregate Queries with Additive Inequalities

Functional Aggregate Queries with Additive Inequalities

Abstract

Motivated by fundamental applications in databases and relational machine learning, we formulate and study the problem of answering functional aggregate queries (FAQ) in which some of the input factors are defined by a collection of additive inequalities between variables. We refer to these queries as FAQ-AI for short.

To answer FAQ-AI in the Boolean semiring, we define relaxed tree decompositions and relaxed submodular and fractional hypertree width parameters. We show that an extension of the InsideOut algorithm using Chazelle’s geometric data structure for solving the semigroup range search problem can answer Boolean FAQ-AI in time given by these new width parameters. This new algorithm achieves lower complexity than known solutions for FAQ-AI. It also recovers some known results in database query answering.

Our second contribution is a relaxation of the set of polymatroids that gives rise to the counting version of the submodular width, denoted by #subw. This new width is sandwiched between the submodular and the fractional hypertree widths. Any FAQ and FAQ-AI over one semiring can be answered in time proportional to #subw and respectively to the relaxed version of #subw.

We present three applications of our FAQ-AI framework to relational machine learning: -means clustering, training linear support vector machines, and training models using non-polynomial loss. These optimization problems can be solved over a database asymptotically faster than computing the join of the database relations.

1 Introduction

We consider the problem of computing functional aggregate queries with inequality joins, or FAQ-AI queries for short. This is a fundamental computational problem that goes beyond databases: core computation for supervised and unsupervised machine learning can be formulated in FAQ-AI.

Inequalities occur naturally in scenarios involving temporal and spatial relationships between objects in databases. In a retail scenario (e.g., TPC-H), we would like to compute the revenue generated by a customer’s orders whose dates closely precede the ship dates of their lineitems. In streaming scenarios, we would like to detect patterns of events whose time stamps follow a particular order [12]. In spatial data management scenarios, we would like to retrieve objects whose coordinates are within a multi-dimensional range or in close proximity of other objects [27]. The evaluation of Core XPath queries over XML documents amounts to the evaluation of conjunctive queries with inequalities expressing tree relationships in the pre/post plane [16].

1.1 Motivating examples

A key insight of this article is that the efficient computation of inequality joins can reduce the computational complexity of supervised and unsupervised machine learning.

Example 1.1.

The -means algorithm divides the input dataset into clusters of similar data points [20]. Each cluster has a mean , which is chosen according to the following optimization (similarity is defined here with respect to the norm):

(1)

Let be the ’th component of mean vector . For a data point , the function computes the difference between the squares of the -distances from to and from to :

A data point is closest to mean from the set of means iff .

To compute the mean vector , we need to compute the sum of values for each dimension over . If the dataset is the join of database relations over schemas , we can formulate this sum computation as a datalog-like query with aggregates [17]:

The above notation means that the answer to query is the sum of over all tuples satisfying the conjunction on the right-hand side. Section 4 gives further queries necessary to compute the -means. As we show in this article, such queries with aggregates and inequalities can be computed asymptotically faster than the join defining . ∎

Simple queries can already highlight the limitations of state-of-the-art evaluation techniques, as shown next.

Example 1.2.

State-of-the-art techniques take time to compute the following query over relations of size :

Examples 3.9 and 3.19 show how to compute and its counting version in time using the techniques introduced in this article.∎

1.2 The Faq-Ai problem

One way to answer the above queries is to view them as functional aggregate queries (FAQ[4] formulated in sum-product form over some semiring. We therefore briefly introduce FAQ over a single semiring.

We first establish notation. For any positive integer , let . For , let denote a variable/attribute, and denote a value in the discrete domain of . For any , define , . That is, is a tuple of variables and is a tuple of values for these variables.

Let a semiring and a multi-hypergraph . To each edge we associate a function called factor. An FAQ query over one semiring with free variables has the form:

(2)

Under the Boolean semiring , the query (2) becomes a conjunctive query: The factors represent input relations, where iff , with some notational overloading. Under the sum-product semiring, the query (2) counts the number of tuples in the join result for each tuple , where the factors are indicator functions . (The notation denotes the indicator function of the event in the semiring : if holds, and otherwise.) To aggregate over some input variable, say , we can designate an identity factor .

Throughout the article, we assume the query size to be a constant and state runtimes in data complexity. It is known [4] that over an arbitrary semiring, the query (2) can be answered in time , where is the size of the largest relation , fhtw denotes the fractional hypertree width of the query, and has no free variables [15]. If has free variables, fhtw-width becomes FAQ-width instead [4]. Here is the size of the largest factor . Over the Boolean semiring, the time can be lowered to  [6], where subw is the submodular width [28] and hides a polylogarithmic factor in .

Motivated by the examples in Section 1.1, we formulate a class of FAQ queries called FAQ-AI:

Definition 1.3 (Faq-Ai).

Given a hyperedge multiset that is partitioned into two multisets , where stands for “skeleton” and stands for “ligament”, the input to a query from the FAQ-AI class is the following:

  1. To each hyperedge , there corresponds a function , as in the FAQ case.

  2. To each hyperedge , there corresponds functions , one for every variable .

The output to the FAQ-AI query is the following:

(3)

The summation is over tuples . The (uni-variate) functions can be user-defined functions, e.g., , or binary predicates with one key in and a numeric value, e.g., a table salary(employee_id, salary_value) where employee_id is a key. The only requirement we impose is that, given , the value can be accessed/computed in -time.

If , then we get back the FAQ formulation (2).

Example 1.4.

The queries in Section 1.1 are instances of (3):

(4)

is over the sum-product semiring. can be over any semiring: Example 3.9 discusses the case of the Boolean semiring while Example 3.19 discusses the sum-product semiring. ∎

1.3 Our contributions

To answer FAQ queries of the form (2), currently there are two dominant width parameters: fractional hypertree width (fhtw [15]) and submodular width (subw [28]).1 It is known that for any query, and in the Boolean semiring we can answer (2) in -time [6, 28]. For non-Boolean semirings, the best known algorithm, called InsideOut [4, 5], evaluates (2) in time . For queries with free variables, fhtw is replaced by the more general notion of FAQ-width (faqw[4]; however, for brevity we discuss the non-free variable case here.

Following [5], both width parameters subw and fhtw can be defined via two constraint sets: the first is the set TD of all tree decompositions of the query hypergraph , and the second is the set of polymatroids on vertices of . The widths subw and fhtw are then defined as maximin and respectively minimax optimization problems on the domain pair TD and , subject to “edge domination” constraints for . Section 2 presents these notions and other related preliminary concepts in detail.

Our contributions include the following:

Answering Faq-Ai over Boolean semiring On the Boolean semiring, one way to answer query (3) is to apply the PANDA algorithm [28], using edge domination constraints on and the set TD of all tree decompositions of . However, we can do better. In Section 3.2 we define a new notion of tree decomposition: relaxed tree decomposition, in which the hyperedges in only have to be covered by adjacent TD bags. Then, we present a variant of the InsideOut algorithm running on these relaxed TDs using Chazelle’s classic geometric data structure [9] for solving the semigroup range search problem. We show that our InsideOut variant meets the “relaxed fhtw” runtime, which is the analog of fhtw on relaxed TD. The PANDA algorithm can use the InsideOut variant as a blackbox to meet the “relaxed subw” runtime. The relaxed widths are smaller than the non-relaxed counterparts, and are strictly smaller for some classes of queries, which means our algorithms yield asymptotic improvements over existing ones.

Answering Faq over an arbitrary semiring Next, to prepare the stage for answering FAQ-AI over an arbitrary semiring, in Section 3.3 we revisit FAQ over a non-Boolean semiring, where no known algorithm can achieve the subw-runtime. Here, we relax the set of polymatroids to a superset of relaxed polymatroids. Then, by adapting the subw definition to relaxed polymatroids, we obtain a new width parameter called “sharp submodular width” (#subw). We show how a variant of PANDA, called #PANDA, can achieve a runtime of for evaluating FAQ over an arbitrary semiring. We prove that , and that there are classes of queries for which #subw is unboundedly smaller than fhtw.

Answering Faq-Ai over an arbitrary semiring Getting back to FAQ-AI, we apply the #subw result under both relaxations: relaxed TD and relaxed polymatroids, to obtain a new width parameter called the relaxed #subw. We show that the new variants of PANDA and InsideOut can achieve the relaxed #subw runtime. We also show that there are queries for which relaxed #subw is essentially the best we can hope for, modulo -sum-hardness.

Applications to relational Machine Learning Equipped with the algorithms for answering FAQ-AI, in Section 4 we return to relational machine learning applications over training datasets defined by feature extraction queries over relational databases. We show how one can train linear SVM, -means, and ML models using Huber/hinge loss functions without completely materializing the output of the feature extraction queries. In particular, this shows that for these important classes of ML models, one can sometimes train models in time sub-linear in the size of the training dataset.

1.4 Related work

Appendix A revisits two prior results on the evaluation of queries with inequalities through FAQ-AI lenses: Core XPath queries over XML documents [14] and inequality joins over tuple-independent probabilistic databases [32]. Throughout the article, we contrast our new width notions with fhtw and subw and our new algorithm #PANDA with the state-of-the-art algorithms PANDA and InsideOut for FAQ and FAQ-AI queries.

Prior seminal work considers the containment and minimization problem for queries with inequalities [23]. The efficient evaluation of such queries continues to receive good attention in the database community [22]. There is a bulk of work on queries with disequalities (not-equal), which are at times referred to as inequalities. Queries with disequalities are a proper subclass of FAQ-AI (since can be represented as ). Prior works [24, 3] present several results for this proper subclass that are stronger than our general results for FAQ-AI in this work. In particular, for queries with disequalities it suffices to consider tree decompositions only for “skeleton” edges (ignoring “ligament” edges which -in this case- are the disequalities) [24, 3], whereas for the more general FAQ-AI we need to consider “relaxed” tree decompositions (see Def. 3.3).

Section 4 reviews relevant works on machine learning.

2 Preliminaries

We assume without loss of generality that semiring operations and can be performed in -time. (When the assumption does not hold, for the set semiring for instance, we can multiply the claimed runtime with the real operation’s runtime.)

2.1 Tree decompositions and polymatroids

We briefly define tree decompositions, fhtw and subw parameters. We refer the reader to the recent survey by Gottlob et al. [13] for more details and historical contexts. In what follows, the hypergraph should be thought of as the hypergraph of the input query, although the notions of tree decomposition and width parameters are defined independently of queries.

A tree decomposition of a hypergraph is a pair , where is a tree and maps each node of the tree to a subset of vertices such that

  1. every hyperedge is a subset of some , (i.e. every edge is covered by some bag),

  2. for every vertex , the set is a non-empty (connected) sub-tree of . This is called the running intersection property.

The sets are called the bags of the tree decomposition.

Let denote the set of all tree decompositions of . When is clear from context, we use TD for brevity.

To define width parameters, we use the polymatroid characterization from Abo Khamis et al. [6]. A function is called a (non-negative) set function on . A set function on is modular if for all , monotone if whenever , and submodular if for all . A monotone, submodular set function with is called a polymatroid. Let denote the set of all polymatroids on .

Given , define the set of edge dominated set functions:

ED (5)

We next define the submodular width and fractional hypertree width of a given hypergraph :

(6)
(7)

It is known [28] that , and there are classes of hypergraphs with bounded subw and unbounded fhtw. Furthermore, fhtw is strictly less than other width notions such as (generalized) hypertree width and tree width.

Remark 2.1.

Prior to Abo Khamis et al. [6], the commonly used definition of is [15]

where is the fractional edge cover number of a vertex set using the hyperedge set . It is straightforward to show, using linear programming duality [6], that

(8)

proving the equivalence of the two definitions. However, the characterization (6) has two primary advantages: (i) it exposes the minimax / maximin duality between fhtw and subw, and more importantly (ii) it makes it completely straightforward to relax the definitions by replacing the constraints by other applicable constraints, as shall be shown in later sections.∎

Definition 2.2 (-connex tree decomposition [7, 35]).

Given a hypergraph and a set , a tree decomposition of is -connex if there is a subset that forms a connected subtree of and satisfies . (Note that could be empty.)

We use to denote the set of all -connex tree decompositions of . (Note that when , .)

2.2 InsideOut and Panda

To answer the FAQ query (2), we need a model for the representation of the input factors . The support of the function is the set of tuples such that . We use to denote the size of its support. For example, if represents an input relation, then is the number of tuples in . In practice, there often are factors with infinite support, e.g., represents a built-in function in a database, an arithmetic operator, or a comparison operator as in (3). To deal with this more general setting, the edge set is partitioned into two sets , where is finite for all and for all . For simplicity, we often state runtimes of algorithms in terms of the “input size” . Moreover, we use to denote the output size of . We always assume that ; otherwise the output size could be infinite.

InsideOut [4, 5] To answer (2), the InsideOut algorithm works by eliminating variables, along with an idea called the “indicator projection”. Its runtime is described by the FAQ-width of the query, a slight generalization of fhtw. For one semiring, we can define by applying Definition (6) over a restricted set of tree decompositions and edge dominated polymatroids. In particular, let denote the set of free variables in (2), and recall from Definition 2.2. Then,

(9)
(10)
(11)

Note that when and (i.e. ). A simple result from Abo Khamis et al. [4] is the following: (Recall that throughout the article we assume the query size to be a constant and state runtimes in data complexity.)

Proposition 2.3 ([4]).

InsideOut answers query (2) in time .

To solve the FAQ-AI (3), we can apply Proposition 2.3 with since all ligament factors are infinite. But this is suboptimal—later, we show a new InsideOut variant that is polynomially better.

Panda [6] For the Boolean semiring, i.e., when the FAQ query (2) is of the form

(12)

we can do much better than Proposition 2.3. When , Marx [28] showed that (12) can be answered in time . The PANDA algorithm [6] generalizes Marx’s result to deal with general degree constraints, and to meet precisely the -runtime. In fact, PANDA works with queries such as (12) with free variables as well. In the context of this article, we can define the following notion of submodular FAQ-width in a natural way:

(13)

Then, the results from Abo Khamis et al. [6] imply:

Proposition 2.4 ([6]).

PANDA answers query (12) in time .

These results only work for the Boolean semiring. Section 3 introduces a variant of PANDA, called #PANDA, that also works for non-Boolean semirings.

2.3 Semigroup range searching

Orthogonal range counting (and searching) is a classic and ubiquitous problem in computational geometry [11]: given a set of points in a -dimensional space, build a data structure that, given any -dimensional rectangle, can efficiently return the number of enclosed points. More generally, there is the semigroup range searching problem [9], where each point of the input points also has a weight , where is a semigroup.2 The problem is: given a -dimensional rectangle , compute .

Classic results by Chazelle [9] show that there are data structures for semigroup range searching which can be constructed in time , and answer rectangular queries in -time. Also, this is almost the best we can hope for [10]. There are more recent improvements to Chazelle’s result (see, e.g., Chan et al. [8]), but they are minor (at most a factor), as the original results were already very close to matching the lower bound.

Most of these range search/counting problems can be reduced to the dominance range searching problem (on semigroups), where the query is represented by a point , and the objective is to return . Here, denotes the “dominance” relation (coordinate-wise ). We can think of as the lower-corner of an infinite rectangle query.

3 Relaxed tree decompositions and relaxed polymatroids

3.1 Connection to semigroup range searching

We always assume that ; otherwise the output size could be infinite. We start with a special case of (3) in which the skeleton part contains only two hyperedges and . Consider the aggregate query of the form

(14)

where and are two input functions/relations over variable sets and , respectively. We prove the following simple but important lemma:

Lemma 3.1.

Let , and . For , query (14) can be answered in time .

Proof.

If there is a hyperedge for which , then in a -time pre-processing step we can “absorb” the factor into the factor , by replacing with . A similar absorption can be done with . Hence, without loss of generality we can assume that and for all . Furthermore, we only need to show that we can compute (14) for , because after is computed, we can marginalize away variables in -time.

Abusing notation somewhat, for each and each , define the function by

(15)

Fix a tuple such that . A tuple is said to be -adjacent if . We show how to compute the following sum in poly-logarithmic time:

(16)
(17)

where the inner sum ranges only over tuples which are -adjacent; non-adjacent tuples contribute .

Now, for the fixed and for each define the following -dimensional points:

We write to say that is dominated by coordinate-wise: . Assign to each point a “weight” of . Now, taking (17),

(18)
(19)

The expression thus computes, for a given “query point” , the weighted sum over all points that dominate the query point. This is precisely the dominance range counting problem, which—modulo a -preprocessing step—can be solved in time [9], as reviewed in Section 2.3.

To conclude the proof, note that (14) can be written as (assuming as is the case in Lemma 3.1)

where the outer sum ranges over tuples in . ∎

Example 3.2.

Let be a binary relation. Suppose we want to count the number of tuples satisfying . By setting , , , the problem can be reduced to the form (14) with , . We can thus compute this count in time .∎

3.2 Relaxed tree decompositions

Equipped with this basic case, we can now proceed to solve the general setting of (3). To this end, we define a new width parameter.

Definition 3.3 (Relaxed tree decomposition).

Let denote a multi-hypergraph whose edge multiset is partitioned into and . A relaxed tree decomposition of (with respect to the partition ) is a pair , where is a tree, and satisfies the following properties:

  • The running intersection property holds: for each node the set is a connected subtree in .

  • Every “skeleton” edge is covered by some bag , .

  • Every “ligament” edge is covered by the union of two adjacent bags and , i.e. , where .

Let denote the set of all relaxed tree decompositions of (with respect to the skeleton-ligament partition). When is clear from context we use for the sake of brevity. Given , let denote the set of all relaxed -connex tree decompositions of .

Faq-Ai on a general semiring

We use relaxed TDs in conjunction with Lemma 3.1 to answer FAQ-AI with a relaxed notion of faqw. In particular, the relaxed width parameters of are defined in exactly the same way as the usual width parameters defined in Section 2, except we allow the TDs to range over relaxed ones.

Definition 3.4 (Relaxed faqw).

Let be an FAQ-AI query (3), and be its hypergraph. Furthermore, let denote the set of hyperedges for which . Then, the relaxed FAQ-width of is defined by

(20)

When , collapses to which is the relaxed fhtw for FAQ-AI without free variables:

(21)

A relaxed tree decomposition of is optimal if its width is equal to , i.e.,

Theorem 3.5.

Any FAQ-AI query of the form (3) on any semiring can be answered in time , where is the maximum number of additive inequalities covered by a pair of adjacent bags in an optimal relaxed tree decomposition.3

Proof.

We first consider the case of no free variables because this case captures the key idea. Fix an optimal relaxed tree decomposition . We first compute, for each bag of the tree decomposition, a factor such that

(22)
(23)

To define the factors , we need the notion of indicator projection [5, 4]. For a given and such that , the indicator projection of onto the bag is a function defined by

(24)

Recall from Definition 3.3 that every is covered by at least one bag for . Fix an arbitrary coverage assignment , where is covered by the bag . Then, the factors are defined by:

(25)

It is easy to verify that (23) holds. Using a worst-case optimal join algorithm [30, 31, 39] we can compute (25) in time

(26)

Over all , our runtime is bounded by , where

(27)

The support of each factor has size bounded by .

Next we compute (23) in time . We will make use of the fact that is a relaxed TD. Fix an arbitrary root of the tree decomposition ; following InsideOut, we compute (23) by eliminating variables from the leaves of up to the root. Without loss of generality, we assume that the tree decomposition is non-redundant, i.e., no bag is a subset of another in the tree decomposition (otherwise the contained bag factor can be “absorbed” into the containee bag factor). Let be any leaf of , be its parent, where and . Now write (23) as follows:

(28)

The third equality uses the semiring’s distributive law. (Note that implies that thanks to Definition 3.3 and the fact that is the only neighbor of .) Lemma 3.1 implies that we can compute the sub-query from (28) in the allotted time. The above step eliminates all variables in . Repeatedly applying the above step yields the desired output .

When the query has free variables, the algorithm proceeds similarly to the case of an FAQ with free variables [4]. ∎

Example 3.6.

Given three binary relations and , consider a query that counts the number of tuples that satisfy:

(29)

The query has and . Let . Note that . In fact, any of the previously known algorithms, e.g. [4, 5], would take time to answer . However, this query has , and by Theorem 3, it can be answered in time . (Note that here .) An optimal relaxed tree decomposition is shown in Figure 1.∎

         

         

         

Figure 1: An optimal relaxed tree decomposition for the query in Example 3.6. Ligament edges are dashed. Each skeleton edge is held in one bag.

We next give a couple of simple lower and upper bounds for . The upper bound shows that, effectively is the best we can hope for, if the FAQ-AI query is arbitrary. The lower bound shows that, while the relaxed tree decomposition idea can improve the runtime by a polynomial factor, it cannot improve the runtime over straightforwardly applying InsideOut (over non-relaxed tree decompositions) by more than a polynomial factor.

Proposition 3.7.

For any positive integer , there exists an FAQ-AI query of the form (3) for which , and it cannot be answered in time , modulo -sum hardness.

Proof.

It is widely assumed [33, 26] that is the best runtime for -sum, which is the following problem: given number sets of maximum size , determine whether there is a tuple such that . We can reduce -sum to our problem: Consider the query over the Boolean semiring:

(30)

The answer to is true iff there is a tuple such that . The reduction shows that our query (30) is -sum-hard. For this query, . ∎

Proposition 3.8.

For any FAQ-AI query of the form (3), we have ; in particular, when has no free variables .

Proof.

Let denote a relaxed tree decomposition of with fractional hypertree width . Construct a new (non-relaxed) tree decomposition for as follows. Each vertex in is also a vertex in with . Moreover, to each edge there corresponds an additional vertex in whose bag is . As for the edge set of , for each edge , there are two corresponding edges in , namely and . We can verify that