Answering Conjunctive Queries under Updates This is the full version of the conference contribution [?].
We consider the task of enumerating and counting answers to -ary conjunctive queries against relational databases that may be updated by inserting or deleting tuples.
We exhibit a new notion of q-hierarchical conjunctive queries and show that these can be maintained efficiently in the following sense. During a linear time preprocessing phase, we can build a data structure that enables constant delay enumeration of the query results; and when the database is updated, we can update the data structure and restart the enumeration phase within constant time. For the special case of self-join free conjunctive queries we obtain a dichotomy: if a query is not q-hierarchical, then query enumeration with sublinear delay and sublinear update time (and arbitrary preprocessing time) is impossible.
For answering Boolean conjunctive queries and for the more general problem of counting the number of solutions of -ary queries we obtain complete dichotomies: if the query’s homomorphic core is q-hierarchical, then size of the the query result can be computed in linear time and maintained with constant update time. Otherwise, the size of the query result cannot be maintained with sublinear update time.
All our lower bounds rely on the OMv-conjecture, a conjecture on the hardness of online matrix-vector multiplication that has recently emerged in the field of fine-grained complexity to characterise the hardness of dynamic problems. The lower bound for the counting problem additionally relies on the orthogonal vectors conjecture, which in turn is implied by the strong exponential time hypothesis.
By sublinear we mean for some , where is the size of the active domain of the current database.
We study the algorithmic problem of answering a conjunctive query against a dynamically changing relational database . Depending on the problem setting, we want to answer a Boolean query, count the number of output tuples of a non-Boolean query, or enumerate the query result with constant delay. We consider finite relational databases over a possibly infinite domain in the fully dynamic setting where new tuples can be inserted or deleted.
At the beginning, a dynamic query evaluation algorithm gets a query together with an initial database . It starts with a preprocessing phase where a suitable data structure is built to represent the result of evaluating against . Afterwards, when the database is updated by inserting or deleting a tuple, the data structure is updated, too, and the result of evaluating on the updated database is reported.
The update time is the time needed to compute the representation of the new query result. In order to be efficient, we require that the update time is way smaller than the time needed to recompute the entire query result. In particular, we consider constant update time that only depends on the query but not on the database, as feasible. One can even argue that update time that scales polylogarithmically () with the size of the database is feasible. On the other hand, we regard update time that scales polynomially () with the database as infeasible.
This paper’s aim is to classify those conjunctive queries (CQs, for short) that can be efficiently maintained under updates, and to distinguish them from queries that are hard in this sense.
We identify a subclass of conjunctive queries that can be efficiently maintained by a dynamic evaluation algorithm. We call these queries q-hierarchical, and this notion is strongly related to the hierarchical property that was introduced by Dalvi and Suciu in [?] and has already played a central role for efficient query evaluation in various contexts (see Section 3 for a definition and discussion of related concepts). We show that after a linear time preprocessing phase the result of any q-hierarchical conjunctive query can be maintained with constant update time. This means that after every update we can answer a Boolean q-hierarchical query and compute the number of result tuples of a non-Boolean query in constant time. Moreover, we can enumerate the query result with constant delay.
We are also able to prove matching lower bounds. These bounds are conditioned on the OMv-conjecture, a conjecture on the hardness of online matrix-vector multiplication that was introduced by Henzinger, Krinninger, Nanongkai, and Saranurak in [?] to characterise the hardness of many dynamic problems. The lower bound for the counting problem additionally relies on the OV-conjecture, a conjecture on the hardness of the orthogonal vectors problem which in turn is implied by the well-known strong exponential time hypothesis [?]. We obtain the following dichotomies, which are stated from the perspective of data complexity (i.e., the query is regarded to be fixed) and hold for any fixed . By we always denote the size of the active domain of the current database . For the enumeration problem we restrict our attention to self-join free CQs, where every relation symbol occurs only once in the query.
For the databases we construct in our lower bound proofs it holds that . Therefore all our lower bounds of the form translate to in terms of the size of the database.
In more practically motivated papers the task of answering a fixed query against a dynamic database has been studied under the name incremental view maintenance (see e. g. [?]). Given the huge amount of theoretical results on the complexity of query evaluation in the static setting, surprisingly little is known about the computational complexity of query evaluation under updates.
The dynamic descriptive complexity framework introduced by Patnaik and Immerman [?] focuses on the expressive power of (fragments or extensions of) first-order logic on dynamic databases and has led to a rich body of literature (see [?] for a survey). This approach, however, is quite different from the algorithmic setting considered in the present paper, as in every update step the internal data structure is updated by evaluating a first-order query. As this may take polynomial time even for the special case of conjunctive queries (as considered e. g. by Zeume and Schwentick in [?]), this is too expensive in the area of dynamic algorithms.
We are aware of only a few papers dealing with the computational complexity of query evaluation under updates, namely the study of XPath evaluation of Björklund, Gelade, and Martens [?] and the studies of MSO queries on trees by Balmin, Papakonstantinou, and Vianu [?] and by Losemann and Martens [?], the latter of which is to the best of our knowledge the only work that deals with efficient query enumeration under updates.
In the static setting, a lot of research has been devoted to classify those conjunctive queries that can be answered efficiently. Below we give an overview of known results.
Complexity of Boolean Queries. The complexity of answering Boolean conjunctive queries on a static database is fairly well understood. For every fixed database schema , extending a result of [?], Grohe [?] gave a tight characterisation of the tractable CQs under the complexity theoretic assumption FPT W: If we are given a Boolean CQ of size and a -database of size , then can be answered against in time for some computable function if, and only if, the homomorphic core of has bounded treewidth. Marx [?] extended this classification to the case where the schema is part of the input.
Counting Complexity. For computing the number of output tuples of a given join query (i.e., a quantifier-free CQ) over a fixed schema , a characterisation was proven by Dalmau and Jonsson [?]: Assuming FPT #W, the output size of a join query evaluated on a -database of size can be computed in time if, and only if, has bounded treewidth. The result has recently been extended to all conjunctive queries over a fixed schema by Chen and Mengel [?]. Structural properties that make the counting problem for CQs tractable in the case where the schema is part of the input have been identified in [?].
Join Evaluation. When the entire result of a non-Boolean query has to be computed, the evaluation problem cannot be modelled as a decision or counting problem and one has to come up with different measures to characterise the hardness of query evaluation. One approach that has been fruitfully applied to join evaluation is to study the worst-case output size as a measure of the hardness of a query. Atserias, Grohe, and Marx [?] identified the fractional edge cover number of the join query as a crucial measure for lower bounding its worst-case output size. This bound was also shown to be optimal and is matched by so called “worst-case optimal” join evaluation algorithms, see [?].
Query Enumeration. Another way of studying non-Boolean queries that is independent of the actual or worst-case output size is query enumeration. A query enumeration algorithm evaluates a non-Boolean query by reporting, one by one without repetition, the tuples in the query result. The crucial measure to characterise queries that are tractable w.r.t. enumeration is the delay between two output tuples. In the context of constraint satisfaction, the combined complexity, where the query as well as the database are given as input, has been considered. As the size of the query result might be exponential in the input size in this setting, queries that can be enumerated with polynomial delay and polynomial preprocessing are regarded as “tractable.” Classes of conjunctive queries that can be enumerated with polynomial delay have been identified in [?]. However, a complete characterisation of conjunctive queries that are tractable in this sense is not in sight.
More relevant to the database setting, where one evaluates a small query against a large database, is the notion of constant delay enumeration introduced by Durand and Grandjean in [?]. The preprocessing time is supposed to be much smaller than the time needed to evaluate the query (usually, linear in the size of the database), and the delay between two output tuples may depend on the query, but not on the database. A lot of research has been devoted to this subject, where one usually tries to understand which structural restrictions on the query or on the database allow constant delay enumeration. For an introduction to this topic and an overview of the state-of-the-art we refer the reader to the surveys [?].
Bagan, Durand, and Grandjean [?] showed that acyclic conjunctive queries that are free-connex can be enumerated with constant delay after a linear time preprocessing phase (cf. [?] for a simplified proof of their result). They also showed that for self-join free acyclic conjunctive queries the free-connex property is essential by proving the following lower bound. Assume that multiplying two matrices cannot be done in time . Then the result of a self-join free acyclic conjunctive query that is not free-connex cannot be enumerated with constant delay after a linear time preprocessing phase.
It turns out that our notion of q-hierarchical conjunctive queries is a proper subclass of the free-connex conjunctive queries. Thus, there are queries that can be efficiently enumerated in the static setting but are hard to maintain under database updates.
Organisation. The remainder of the paper is structured as follows. In Section 2 we fix the basic notation along with the concept of dynamic algorithms for query evaluation. Section 3 introduces q-hierarchical queries and formally states our main theorems. We then present an alternative characterisation of q-hierarchical queries in Section ? and prove our lower and upper bound theorems in Sections Section 4 and Section 5, respectively. We conclude in Section 6.
Acknowledgement. We acknowledge the financial support by the German Research Foundation DFG under grant SCHW 837/5-1. The first author wants to thank Thatchaphol Saranurak and Ryan Williams for helpful discussions on algorithmic conjectures.
We write for the set of non-negative integers and let and for all . By we denote the power set of a set .
Databases. We fix a countably infinite set , the domain of potential database entries. Elements in are called constants. A schema is a finite set of relation symbols, where each is equipped with a fixed arity . Let us fix a schema , and let for . A database of schema (-db, for short), is of the form , where is a finite subset of . The active domain of is the smallest subset of such that for all .
Updates. We allow to update a given database of schema by inserting or deleting tuples as follows. An insertion command is of the form for , , and . When applied to a -db , it results in the updated -db with and for all . A deletion command is of the form for , , and . When applied to a -db , it results in the updated -db with and for all . Note that both types of commands may change the database’s active domain.
Queries. We fix a countably infinite set of variables. An atomic query (for short: atom) of schema is of the form with , , and . The set of variables occurring in is denoted by . A conjunctive query (CQ, for short) of schema is of the form
where , , is an atomic query of schema for every , and are pairwise distinct elements in . Join queries are quantifier-free CQs, i.e., CQs of the form . A CQ is called self-join free (or non-repeating or simple) if no relation symbol occurs more than once in the query. For a CQ of the form we let be the set of all variables occurring in , and we let be the set of free variables.
For , a -ary conjunctive query (-ary CQ, for short) is of the form , where is a CQ of schema , , and is a list of the free variables of . We will often assume that the tuple is clear from the context and simply write instead of . The semantics of CQs are defined as usual: A valuation is a mapping . For a -db and an atomic query we write to indicate that . For a -ary CQ where is of the form , and for a tuple , a valuation is said to be compatible with iff for all . We write to indicate that there is a valuation that is compatible with such that for all . The query result is defined as the set of all tuples with . Clearly, .
A Boolean CQ is a CQ with . As usual, for Boolean CQs we will write instead of , and instead of .
Sizes and Cardinalities. The size of a CQ is defined as the length of when viewed as a word over the alphabet . For a -ary CQ and a -DB , the cardinality of the query result is the number of tuples in .
The cardinality of a -db is defined as the number of tuples stored in , i.e., . The size of is defined as and corresponds to the size of a reasonable encoding of . We will often write to denote the cardinality of ’s active domain.
Dynamic Algorithms for Query Evaluation. We use Random Access Machines (RAMs) with word-size and a uniform cost measure as a model of computation. In particular, adding and multiplying integers that are polynomial in the input size can be done in constant time. For our purposes it will be convenient to assume that . We will assume that the RAM’s memory is initialised to . In particular, if an algorithm uses an array, we will assume that all array entries are initialised to , and this initialisation comes at no cost (in real-world computers this can be achieved by using the lazy array initialisation technique, cf. e.g. the textbook [?]). A further assumption that is unproblematic within the RAM-model, but unrealistic for real-world computers, is that for every fixed dimension we have available an unbounded number of -ary arrays such that for given the entry at position can be accessed in constant time.
Our algorithms will take as input a -ary CQ and a -db . For all query evaluation problems considered in this paper, we aim at routines and which achieve the following:
upon input of and , builds a data structure which represents (and which is designed in such a way that it supports efficient evaluation of on )
upon input of a command (with ), calling modifies the data structure such that it represents the updated database .
The preprocessing time is the time used for performing ; the update time is the time used for performing an .
In the following, will always denote the database that is currently represented by the data structure . To solve the enumeration problem under updates, apart from the routines and we aim at a routine such that calling invokes an enumeration of all tuples (without repetition) that belong to the query result . The delay is the maximum time used during a call of
until the output of the first tuple (or the end-of-enumeration message , if ),
between the output of two consecutive tuples, and
between the output of the last tuple and the end-of-enumeration message .
To solve the counting problem under updates, instead of we aim at a routine which outputs the cardinality of the query result. The counting time is the time used for performing a . To answer a Boolean conjunctive query under updates, instead of or we aim at a routine that produces the answer or of on . The answer time is the time used for performing . Whenever speaking of a dynamic algorithm, we mean an algorithm that has routines and and, depending on the problem at hand, at least one of the routines , , and .
Throughout the paper, we often adopt the view of data complexity and use the -notation to suppress factors that may depend on the query but not on the database. For example, “linear preprocessing time” means and “constant update time” means , for a function with codomain . When writing we mean .
Our notion of q-hierarchical conjunctive queries is related to the hierarchical property that has already played a central role for efficient query evaluation in various contexts. It has been introduced by Dalvi and Suciu in [?] to characterise the Boolean CQs that can be answered in polynomial time on probabilistic databases. They obtained a dichotomy stating for self-join free queries that the complexity of query evaluation on probabilistic databases is in PTIME for hierarchical queries and #P-complete for non-hierarchical queries. Fink and Olteanu [?] generalised the notion and the dichotomy result to non-Boolean queries and to queries using negation. In the different context of query evaluation on massively parallel architectures, Koutris and Suciu [?] considered hierarchical join queries and singled out a subclass of so-called tall-flat queries as exactly those queries that can be computed with only one broadcast step in their Massively Parallel model of query evaluation. For further information on the various uses of the hierarchical property we refer the reader to [?].
The definition of hierarchical queries relies on the following notion. Consider a CQ of the form . For every variable we let be the set of all atoms of such that . Dalvi and Suciu [?] call a Boolean CQ hierarchical iff the condition
is satisfied by all variables . An example for a hierarchical Boolean CQ is .
In [?], Koutris and Suciu transferred the notion to join queries , which they call hierarchical iff condition is satisfied by all variables . In [?], Fink and Olteanu introduced a slightly different notion for a more general class of queries. Translated into the setting of CQs, their notion (only) requires that condition is satisfied by all quantified variables, i.e., variables . Obviously, both notions coincide on Boolean CQs, but on join queries Koutris and Suciu’s notion is more restrictive than Fink and Olteanu’s notion (according to which all quantifier-free CQs are hierarchical). For example, the join query
is hierarchical w.r.t. Fink and Olteanu’s notion, and non-hierarchical w.r.t. Koutris and Suciu’s notion.
In the context of answering queries under updates, our lower bound results show that the join query , as well as its Boolean version
are intractable. A further query that is hierarchical, but intractable in our setting is
To ensure tractability of a conjunctive query in our setting, we will require that its quantifier-free part is hierarchical in Koutris and Suciu’s notion and, additionally, the quantifiers respect the query’s hierarchical form. We call such queries q-hierarchical.
Note that a Boolean CQ is q-hierarchical iff it is hierarchical, and a join query is q-hierarchical iff it is hierarchical w.r.t. Koutris and Suciu’s notion. The queries and are minimal examples for queries that are not q-hierarchical because they do not satisfy condition and , respectively. Regarding the query , note that all other versions such as the query , the join query , and the Boolean query , are q-hierarchical.
It is not hard to see that we can decide in polynomial time whether a given CQ is q-hierarchical(see Lemma ?). Our first main result shows that all q-hierarchical CQs can be efficiently maintained under database updates:
Note that this implies that q-hierarchicalBoolean conjunctive queries can be answered in constant time. Our algorithm crucially relies on the tree-like structure of hierarchical queries, which has already been used for efficient query evaluation in [?]. In Section ? we present the notion of a q-tree and show that it precisely characterises the q-hierarchical conjunctive queries. These q-trees serve as a basis for the data structure used in our dynamic algorithm for query answering. Details on this algorithm along with a proof of Theorem ? can be found in Section 5. Let us mention that every q-tree is an f-tree in the sense of [?], but there exist f-trees that are no q-trees. The dynamic data structure that is computed by our algorithm can be viewed as an f-representation of the query result [?], but not every f-representation can be efficiently maintained under database updates.
We now discuss our further main results, which show that the q-hierarchical property is necessary for designing efficient dynamic algorithms, and that the results from Theorem ? cannot be extended to queries that are not q-hierarchical. As discussed in the introduction, our lower bounds rely on the OMv-conjecture and the OV-conjecture. For more details on these conjectures, as well as proofs of our lower bound theorems, we refer the reader to Section 4. In the following, denotes the initial database that serves as input for the routine, and denotes the size of the active domain of a dynamically changing database . Our first lower bound theorem states that non-q-hierarchicalself-join free conjunctive queries cannot be enumerated efficiently under updates.
For Boolean CQs we obtain a lower bound for all queries, i.e., also for queries that are not self-join free. To state the result, we need the standard notion of a homomorphic core. A homomorphism from a conjunctive query to a conjunctive query is a mapping from to such that for all , and if is an atom of , then is an atom of . The homomorphic core (for short, core) of a conjunctive query is a minimal subquery of such that there is a homomorphism from to , but no homomorphism from to a proper subquery of . By Chandra and Merlin’s homomorphism theorem, every CQ has a unique (up to isomorphism) core , and for all databases (cf., e. g., [?]). While self-join free queries are their own cores, the situation is different for general CQs. Consider, for example, the queries
Here, is a core of and thus for every database . However, is q-hierarchical, whereas is not. The next lower bound theorem states that the result of a Boolean conjunctive query cannot be maintained efficiently if the query’s core is not q-hierarchical.
Let us now turn to the problem of computing the cardinality of the result of a query . From the Theorems ? and ? we know that we can efficiently decide whether if, and only if, the homomorphic core of is q-hierarchical. The complexity of actually counting the number of tuples in , however, depends on whether the core of the query itself (rather than the core of its Boolean version ) is q-hierarchical. As in the Boolean case, the next theorem (together with Theorem ?) implies a dichotomy for all conjunctive queries. One difference is that we have to additionally rely on the OV-conjecture.
Combining Theorem ? with the Theorems ?, ?, and ? immediately leads to the dichotomies (Theorems ?, ?, and ?) stated in the introduction.
We now give an alternative characterisation of q-hierarchical queries that sheds more light on their “tree-like” structure and will be useful for designing efficient query evaluation algorithms. We say that a CQ is connected if for any two variables there is a path such that for each there is an atom of such that . Note that every conjunctive query can be written as a conjunction of connected conjunctive queries over pairwise disjoint variable sets. We call these the connected components of the query. Note that a query is q-hierarchical if, and only if, all its connected components are q-hierarchical. Next, we define the notion of a q-tree for a connected query and show that is q-hierarchical iff it has a q-tree.
See Figure 1 for examples of q-trees. The following lemma gives a characterisation of the q-hierarchical conjunctive queries via q-trees.
To prove the lemma we inductively apply the following claim.
For simplicity, we associate with every conjunctive query the hypergraph with vertex set and hyperedges for every atom in . For a variable we let be the set of hyperedges that contain . Let us recall some basic notation concerning hypergraphs. A path of length in is a sequence of variables such that for every there is a hyperedge containing and . Two variables have distance if they are connected by a path of length , but not by a path of length . We first show that () every pair of hyperedges in has a non-empty intersection. Suppose for contradiction that there are two hyperedges and with , let and be two variables of distance , and be a shortest path connecting both variables. Hence, for every there is a hyperedge containing and but no other variable from the path. Therefore it holds that . Furthermore, we have and hence . On the other hand, the hyperedge containing and does not contain and therefore , which contradicts the assumption that is q-hierarchical.
We now prove that there is a variable that is contained in every hyperedge (and hence in every atom of ). We consider two cases. First suppose that for every pair of hyperedges , it holds that either or . Then there is a minimal hyperedge that is contained in every other hyperedge, and thus all of the variables in this hyperedge are contained in all other hyperedges as well.
Now suppose that there are two hyperedges , such that and . By (), both hyperedges have a non-empty intersection. Thus, we can choose some . We want to argue that is contained in every hyperedge of and assume for contradiction that there is a hyperedge that does not contain . By () we can choose some that is contained in the non-empty intersection of and . But now we have , , and , contradicting that is q-hierarchical.
Let be the set of all variables that are contained in every hyperedge. We have already shown that . To ensure that there is a free variable in , note that (by the definition of q-hierarchical CQs) if and , then . Hence, implies that we can choose a variable from that satisfies the conditions of the claim.
The proof of the “if” direction is easy, as every connected component that has a q-tree must be q-hierarchical, because if is a descendant of in , then .
For proving the “only if” direction of Lemma ? we inductively apply Claim ? to construct a q-tree for all connected conjunctive queries with at most variables. The induction start for empty queries is trivial. For the induction step, assume that there is a q-tree for every connected q-hierarchical query with at most variables, and let be a connected q-hierarchical query with variables. By Claim ? there is at least one variable that is contained in every atom, and if there is a free variable with this property. We choose such a variable (preferring free over quantified variables) and let be the root of .
Now we consider the query that is obtained from by “removing” from every atom. As this query is still q-hierarchical, we can find by induction a tree for every connected component of . We let be the disjoint union of the together with the root and conclude the construction by adding an edge from to the root of each . It is easy to see that this construction can be computed in polynomial time.
We write to denote the -th component of an -dimensional vector , and we write for the entry in row and column of an matrix .
We consider matrices and vectors over . All the arithmetic is done over the Boolean semiring, where multiplication means conjunction and addition means disjunction. For example, for -dimensional vectors and we have if and only if there is an such the . Let be an matrix and let be a sequence of vectors, each of which has dimension . The online matrix-vector multiplication problem is the following algorithmic task. At first, the algorithm gets an matrix and is allowed to do some preprocessing. Afterwards, the algorithm receives the vectors one by one and has to output before it has access to (for each ). The running time is the overall time the algorithm needs to produce the output .
It is easy to see that this problem can be solved in time; the best known algorithm runs in time [?]. The OMv-conjecture was introduced by Henzinger, Krinninger, Nanongkai, and Saranurak in [?] and states that the online matrix-vector multiplication problem cannot be solved in “truly subcubic” time for any . Note that the hardness of online matrix-vector multiplication crucially depends on the requirement that the algorithm does not receive all vectors at once. In fact, without this requirement the output could be computed in time by using any fast matrix multiplication algorithm. The OMv-conjecture has been used to prove conditional lower bounds for various dynamic problems and is a common barrier for improving these algorithms, see [?]. Contrary to classical complexity theoretic assumptions such as , this conjecture shares with other recently proposed algorithmic conjectures the less desirable fact that it can hardly be called “well-established”. However, at least we know that improving dynamic query evaluation algorithms for queries that are hard under the OMv-conjecture is a very difficult task and (even if not completely inconceivable) would lead to major breakthroughs in algorithms for e.g. matrix multiplication (see [?] for a discussion).
A variant of OMv that is useful as an intermediate step in our reductions is the following OuMv problem. Again, we are given an matrix and are allowed to do some preprocessing. Afterwards, a sequence of pairs of vectors arrives for each , and the task is to compute . As before, the algorithm has to output before it gets as input. It is known that OuMv is at least as difficult as OMv.
While OuMv and OMv turn out to be suitable for Boolean CQs and the enumeration of -ary CQs, our lower bound for the counting complexity additionally relies on the orthogonal vectors conjecture (also known as the Boolean orthogonal detection problem, see [?]). It is not known whether this conjecture implies or is implied by the OMv-conjecture. However, it is implied by the strong exponential time hypothesis (SETH) [?] and typically serves as a basis for SETH-based lower bounds of polynomial time algorithms.
The orthogonal vectors problem (OV) is the following static decision problem. Given two sets and of Boolean vectors of dimension , decide whether there are and such that . This problem can clearly be solved in time by checking all pairs of vectors, and also slightly better algorithms are known [?]. The OV-conjecture states that this problem cannot be solved in truly subquadratic time if . The exact formulation of this conjecture in terms of the parameters varies in the literature, but all of them imply the following simple variant which is sufficient for our purposes.
Before we establish the lower bounds in full generality, we illustrate the main ideas along the two representative examples and defined in and . Note that if a conjunctive query is not q-hierarchical, then according to Definition ? there are two distinct variables and that do not satisfy one of the two conditions. The Boolean query is an example of a query where and do not satisfy the first condition (i.e., the condition of being hierarchical), and is a query where the quantifier-free part is hierarchical, but where and do not satisfy the second condition on the free variables. Intuitively, every non-q-hierarchical query has a subquery whose shape is similar to either or (we will make this precise in Section 4.4).
Let us show how the OMv-conjecture can be applied to obtain a lower bound for answering the Boolean query under updates.
We show how a query evaluation algorithm for can be used to solve OuMv. We get the matrix and start the preprocessing phase of our evaluation algorithm for with the empty database where . As this database has constant size, the preprocessing phase finishes in constant time. We apply at most update steps to ensure that is the relation corresponding to the adjacency matrix . This preprocessing takes time . If we get two vectors and in the dynamic phase of the OuMv problem, we update and so that their characteristic vectors agree with and , respectively. Now we answer on within time and output if and otherwise. Note that by construction this answer agrees with . The time of each step of the dynamic phase of OuMv is bounded by , and the overall running time for OMv accumulates to .
Note that a lower bound on the answer time of a Boolean query directly implies the same lower bounds for the time needed to count the number of tuples and for the delay of an enumeration algorithm. Furthermore, this also holds true for any query that is obtained from the Boolean query by removing quantifiers.
Now we turn to our second example . Note that the Boolean version is q-hierarchical and hence can be answered in constant time under updates by Theorem ?. Thus, a lower bound on the delay does not follow from a corresponding lower bound on the Boolean version. Instead we obtain the lower bound by a direct reduction from OMv.
We show that an enumeration algorithm with update time and delay helps to solve OMv in time . As in the proof of Lemma ?, we are given an matrix , start with the empty database where and perform at most update steps to ensure that . In the dynamic phase of OMv, when a vector arrives, we perform at most insertions or deletions to the relation such that is the characteristic vector of . Afterwards, we wait until the enumeration algorithm outputs the set and output the characteristic vector of this set. By construction we have . If the enumeration algorithm has update time and delay , then the overall running time of this step is bounded by which by the assumptions of our lemma is bounded by . Hence, the overall running time for solving the OMv is bounded by .
Finally, we consider the counting problem for . Again, we cannot reduce from its tractable Boolean version. Moreover, we were not able to use OMv directly, in a similar way as in the proof of the previous lemma. Instead, we reduce from the orthogonal vectors problem.
As in the previous proof we assume that there is a dynamic counting algorithm for and start its preprocessing with the empty database over the schema . Afterwards, we use at most updates (where ) to encode all -dimensional vectors , …, in into the binary relation such that if and only if the -th component of is . Then we make at most updates to to ensure that the first vector is the characteristic vector of . Now we compute . Note that if and only if for some . If this is the case, we output that there is a pair of orthogonal vectors. Otherwise, we know that is not orthogonal to any and apply the same procedure for , which requires again at most updates to and one call of the routine. Repeating this procedure for all vectors in takes time and solves OV in subquadratic time.
4.4Proofs of the Main Theorems
In this section we prove our lower bound Theorems ?, ?, and ?.
We will use standard notation concerning homomorphisms (cf., e.g. [?]). In particular, for CQs and we will write to indicate that is a homomorphism from to (as defined in Section 3). A homomorphism from a database to a CQ is a mapping from to such that whenever is a tuple in some relation of , then is an atom of . A homomorphism from a CQ to a database is a mapping from to such that whenever is an atom of , then . Obviously, for a -ary CQ and a database we have .
We first generalise the proof idea of Lemma ? to all Boolean conjunctive queries that do not satisfy the requirement of Definition ?. Thus assume that there are two variables and three atoms of with , , and . Without loss of generality we assume that . For a given matrix we fix a domain that consists of elements . For we let be the injective mapping from to with , , and for all .
For the matrix and for -dimensional vectors and , we define a -db with as follows (recall our notational convention that denotes the -th entry of a vector ). For every atom in we include in the tuple
for all such that , if ,
for all such that , if ,
for all such that , if , and
for all , if .
Note that the relations in the atoms , , and are used to encode , , and , respectively. Moreover, since () does not contain the variable (), two databases and ’ ’ differ only in at most tuples. Therefore, can be obtained from by update steps. It follows from the definitions that is a homomorphism from to if and only if , , and . Therefore, if and only if there are such that is a homomorphism from to .
We let be the (surjective) mapping from to defined by , , and for all and . Clearly, is a homomorphism from to . Obviously, the following is true for every mapping from to and for all : if for some , then ; if for some , then ; if for some , then .
We define the partition of and say that a mapping from to respects , if for each set from the partition there is exactly one element in the image of .
For one direction assume that . Then there are such that is a homomorphism from to that respects . For the other direction assume that is a homomorphism that respects . It follows that is a bijective homomorphism from to . Therefore, it can easily be verified that is a homomorphism from to which equals for some . This implies that .
Assume for contradiction that is a homomorphism that does not respect . Then is a homomorphism from into a proper subquery of , contradicting that is a core.
Assume for contradiction that the query answering problem for and hence for its non-q-hierarchical core can be solved with update time and answer time . We can use this algorithm to solve OuMv in time as follows.
In the preprocessing phase, we are given the matrix and let , be the all-zero vectors of dimension . We start the preprocessing phase of our evaluation algorithm for with the empty database. As this database has constant size, the preprocessing phase finishes in constant time. Afterwards, we use at most insert operations to build the database . All this is done within time .
When a pair of vectors , (for ) arrives, we change the current database into by using at most update steps. By the Claims ? and ? we know that if, and only if, there is a homomorphism from to . Hence, after answering the Boolean query on in time we can output the value of . The time of each step of the dynamic phase of OuMv is bounded by . Thus, the overall running time sums up to , contradicting the OMv-conjecture by Theorem ?.
The same reduction from OuMv to the query evaluation problem for conjunctive queries is also useful for the lower bound on the counting problem, provided that the query is not hierarchical. If the query is hierarchical, but the quantifiers are not in the correct form (such that the query is not q-hierarchical), then OuMv does not provide us with the desired lower bound proof and we have to stick to the OV-conjecture instead. Another crucial difference between the Boolean and the non-Boolean case is the following: the dynamic counting problem for the Boolean query is easy (because its core is ), whereas the dynamic counting problem for its non-Boolean version is hard (because the query is a non-q-hierarchical core). To take care of this phenomenon we utilise the following lemma.
In the static setting, a similar result was shown by Chen and Mengel (see Section 7.1 in [?]) and it turns out that our dynamic version can be proven using the same techniques. We remark that the lemma holds even if we drop the additional requirement on the existence of the homomorphism . However, as the databases we construct in our lower bound proof have the desired structure, this additional requirement helps to simplify the proof.
We first reduce the given task to counting tuples up to permutations, that is, computing the size of the set
Let be the set of all permutations such that the mapping extends to an endomorphism on (i.e., a homomorphism from to ). We now show that
First note that if , then for all . Thus, the -direction of follows because all are pairwise disjoint. For the other direction, consider an arbitrary tuple . In particular, there is a permutation such that for all . Furthermore, since , there is a homomorphism with for all . When combining with the homomorphism given by the lemma’s assumption, we obtain the endomorphism on that satisfies for all . Thus, . To conclude the proof of the -direction of , it remains to show that , i.e., it remains to show that there is a homomorphism with for all . Since is a permutation, for some . Thus, iterating for times yields the endomorphism on with for all . Therefore, choosing completes the proof of .
As the set depends only on the query and can be computed in time in the preprocessing phase, it suffices to store and update information on the number , whenever the database is updated. To do this efficiently, we store for every and every the sizes of the auxiliary sets