Obtaining Information about Queries Behind Views and Dependencies

Obtaining Information about Queries
Behind Views and Dependencies

Rada Chirkova
Department of Computer Science
NC State University, Raleigh, NC 27695, USA
Ting Yu
Department of Computer Science
NC State University, Raleigh, NC 27695, USA
chirkova@csc.ncsu.edu yu@csc.ncsu.edu
Abstract

We consider the problems of finding and determining certain query answers and of determining containment between queries; each problem is formulated in presence of materialized views and dependencies under the closed-world assumption. We show a tight relationship between the problems in this setting. Further, we introduce algorithms for solving each problem for those inputs where all the queries and views are conjunctive, and the dependencies are embedded weakly acyclic [13]. We also determine the complexity of each problem under the security-relevant complexity measure introduced in [31]. The problems studied in this paper are fundamental in ensuring correct specification of database access-control policies, in particular in case of fine-grained access control. Our approaches can also be applied in the areas of inference control, secure data publishing, and database auditing.

\numberofauthors

2

1 Introduction

In this paper, we consider the problems of finding and determining certain answers to relational queries, and of containment between relational queries. For the former two problems, we build on the setting of [1], and for the latter – on the setting of [31]; we point out and exploit a tight relationship between the settings. To begin with, in all these settings the set of databases (a.k.a. instances) of interest is not given directly, and is instead specified via a set of “materialized views.” That is, we are given definitions of one or more named queries (definitions of views). We are also given a set of answer tuples for each view, that is, each view is materialized into a relation. Intuitively, each set of materialized views specifies a set of “base instances” , such that each relation in can be obtained as an answer, on the instance , to the respective view definition. In addition, for a given set of integrity constraints (dependencies) on the instances of interest, we deem relevant only those instances that satisfy all the dependencies. In summary, we consider the problems of finding and determining certain query answers and the problem of query containment, each with respect to the sets of base instances specified by a given set of materialized views and a given set of dependencies.

The following motivating example draws on the area of database security called “database-access control” [7].

Example 1.1

Suppose a relation Emp stores information about employees of a company, using attributes Name, Dept (department), and Salary. Two other relations of interest are HQDept(Dept) and OfficeInHQ (Name,Office). The relation HQDept stores the names of the departments that are located in the company headquarters; OfficeInHQ associates employees working in the headquarters with their office addresses.

We now describe the integrity constraint (dependency) that holds on the database schema P Emp, HQDept, OfficeInHQ . Suppose that for all the departments located in the company headquarters, all their employees have their offices in the headquarters. This can be expressed as a “tuple-generating dependency” [2], which we call . (Please see Example 4.1 for a formalization.)

Let a “secret query” [20] Q ask for the names and salaries of all the employees who work in the company headquarters. We can formulate Q in the standard relational query language SQL, as follows:

(Q):SELECT DISTINCT Emp.Name, Salary FROM Emp, OfficeInHQ
    WHERE Emp.Name = OfficeInHQ.Name;

Consider three views, U, V, and W, that are defined for some class(es) of users, in SQL on the schema P. The view U returns the relation HQDept, the view V returns the department names for each employee, and W returns the salaries in each department:

(U): DEFINE VIEW U(Dept) AS SELECT * FROM HQDept;
(V): DEFINE VIEW V(Name,Dept) AS
     SELECT DISTINCT Name, Dept FROM Emp;
(W): DEFINE VIEW W(Dept,Salary) AS
     SELECT DISTINCT Dept, Salary FROM Emp;

Consider database users who are authorized to see only the answers to the views U, V and W. (In particular, these users are not authorized to see any answers to the query Q.) That is, U, V, and W are access-control views for these users on the database with the schema P. Suppose that at some point in time, these users can see the following set of answers to the views:

U(sales),V(johnDoe,sales),W(sales,50000)

A basic security question in this setting is as follows: Can these users find out which tuples must be in the answer to the query Q on all the relevant “back-end” databases? If the answer to this question is positive, then, intuitively, there is a security breach, in the form of leakage of some answers to Q to unauthorized users.

Let the back-end databases of interest be those instances of schema P (i.e., “base instances”) that satisfy the dependency and that “generate exactly as answers to U, V, and W.” (The latter requirement is the “closed-world assumption,” to be discussed in detail shortly.) Using an algorithm introduced in this paper, we can show that the tuple johnDoe, 50000 is in the answer to the secret query Q on all such base instances. Thus, we can answer the above security question in the positive for the secret query Q in the “materi- alized-view setting” (P, , U, V, W, ).

A tuple that is in the answer to the query of interest on all the relevant base instances is called a certain answer to the query. “Determining a certain query answer” is the problem of determining if a given tuple is a certain answer to a given query w.r.t. the given materialized views and, possibly, dependencies. (E.g., the tuple in Example 1.1 is a certain answer to the query Q in the setting (P, , U, V, W, ).) The problems of finding and determining certain query answers on the instances defined by the given materialized views have been considered both under the “open-world assumption” (OWA) and under the “closed-world assumption” (CWA). That is, for a base instance , consider the answer tuples generated by the given view definitions on . Then, informally, is relevant to the given instance of materialized views under CWA iff these answer tuples together comprise exactly . In contrast, OWA permits to not contain all such tuples.

The classic paper [1] by Abiteboul and Duschka addressed the complexity of determining certain query answers under both OWA and CWA, for a range of query and view languages and in the absence of dependencies. [1] also provided algorithms for finding certain query answers under both OWA and CWA, for queries defined in datalog and for views in nonrecursive datalog, with disequalities () permitted in both, again in the absence of dependencies. The algorithms of [1] are based on the “conditional tables” of [17]. The formulation of the latter problem was extended, in the context of database security, to account for database dependencies; the extended problem was solved in [8, 27], under OWA for more restricted (than in [1]) languages of queries and views and for “embedded dependencies” [2].

Our original motivation for this current work comes from the fact that finding certain query answers is a basic problem in database security, as illustrated by Example 1.1. Moreover, in its security form, this problem makes the most sense under CWA, rather than under OWA (see, e.g., [24]). Intuitively, for those database attackers who are seeking unauthorized answers to a secret query, Q, in presence of a set of view answers , the only relevant base instances are those that “generate exactly ,” that is only the CWA-relevant instances. Suppose the owners of the back-end database run an OWA-based algorithm for finding certain answers to Q w.r.t. . They could then arrive at the empty set of answers (and thus conclude that their database is secure), even though under CWA, the set of certain answers to Q would not be empty for the same and dependencies. Indeed, we can use the results of [8, 27] to show that in Example 1.1, the set of certain answers to the query Q is the empty set under OWA.111The algorithms of [1] could not be applied to the problem instance of Example 1.1, as that instance contains a dependency that is not a “full” [2] (a.k.a. “total” [6]) dependency.

To address this challenge, we have developed a CWA-based approach for finding certain answers to conjunctive queries (CQ queries), in presence of CQ views and of weakly acyclic embedded dependencies [13]. (Similarly to the dependency-free case [1], the CWA version of this problem is harder than the OWA version considered in [8, 27].) We then realized that our techniques can be connected to the solution of [31], by Zhang and Mendelzon, to the (CWA-based) problem of determining containment between queries in presence of materialized views. The latter problem arises, for instance, in determining whether a user query formulated on the base database relations has an equivalent rewriting in terms of the access-control views for this user. (If the answer to the question is positive, then the user query can be answered safely, see [24, 31] for the details.) A natural and practically important generalization of this problem is its extension to the consideration of dependencies holding on the base instances. We have been able to extend to the case of dependencies the algorithm of [31] for their query-containment problem w.r.t. materialized views , by building on our approach to the problem of finding certain query answers w.r.t. .

Our contributions

Our specific contributions are as follows:

  • We formalize the problem of determining containment between two queries under CWA and in presence of materialized views and dependencies, by building on the formalization of [31] that does not consider dependencies.

  • We develop an algorithm for solving this problem, in the case where the input queries and all the view definitions are CQ queries, and the input dependencies are embedded weakly acyclic.

  • We show that the problem of determining certain answers to a query, under CWA and in presence of materialized CQ views and dependencies, is a special case of the above containment problem. It follows that the algorithm that we introduce for the containment problem also solves correctly this “certain-query-answer” problem, for all inputs where the queries and views are CQ queries and the dependencies are embedded weakly acyclic.

  • For the problem of finding all certain query answers under CWA and in presence of materialized views and dependencies, we develop two algorithms that are sound and complete for all inputs where the queries and views are CQ queries and the dependencies are embedded weakly acyclic. The first algorithm uses as a subroutine our algorithm for the “certain-query-answer” problem. The second algorithm both builds on the standard approach to answering queries in relational data exchange, and uses a simpler version of the technique that we use to solve the above containment problem (w.r.t. materialized views and dependencies).

  • We determine the complexity of each of the three problems under the security-relevant complexity measure of [31]. In this measure, it is assumed that everything is fixed except for the materialized views and queries (but not for view definitions).

The problems that we study in the paper can be used to model and analyze a wide range of database-security problems, including database-policy analysis, secure data publishing, inference control, and auditing. For instance, database-security policies are often implemented through views. It is important to ensure that security views are defined correctly, so that no sensitive information can be learned by unauthorized parties from granted view access [8, 29]. Clearly, an information disclosure happens if an attacker can learn certain answers to a secret query. The same modeling can be applied to capture the secure data-publishing problem [20]. Similarly, in database query auditing, answers to user-issued queries can be modeled as materialized views [22, 23]. A potential inference attack happens if those answers combined together can be used to derive secret information as defined by a query.

The remainder of the paper is organized as follows. After discussing related work in Section 2, we review the background definitions and results in Section 3, and then define our three problems of interest in Section 4. In Sections 56 we introduce our approaches to solving the three problems in the CQ weakly acyclic setting. Finally, in Section 7 we address the complexity of the three problems in the CQ weakly acyclic setting.

2 Related Work

The seminal paper [1] by Abiteboul and Duschka addressed the complexity of the problem of determining whether a tuple is a certain query answer in presence of materialized views (the “certain-query-answer problem”) under both OWA and CWA, for a range of query and view languages and in the absence of dependencies. [1] also solved the problem of finding all certain query answers under both OWA and CWA, for datalog queries and for views in nonrecursive datalog, with disequalities () permitted in both, in the absence of dependencies. The algorithms of [1] are based on the “conditional tables” of [17]. (See [2, 16] for detailed overviews of incomplete databases and of their representations, including conditional tables [17].) It is remarked in [1] that their algorithms for finding certain answers could be extended to the case of “full” (or “total”) dependencies [6, 2]. In this current paper, we provide sound and complete algorithms for finding certain answers and for the certain-query-answer problem, under CWA for CQ queries and views and for “weakly acyclic” [13] embedded dependencies, of which the class of full dependencies is a proper subclass. We also address the complexity of both problems in this CQ weakly acyclic setting.

The paper [8] by Brodsky and colleagues introduced in the security context the problem of finding certain query answers under OWA and in presence of embedded dependencies, and proposed a sound and complete algorithm for the case where the queries and views are CQ queries expressible without joins. Then, Stoffel and colleagues in [27] made a connection between this problem and the techniques introduced in data exchange [13, 5, 4], by developing ([27]) a data-exchange-based approach for finding certain query answers, under OWA for CQ views, UCQ queries (i.e., unions of CQ queries), and embedded dependencies.

In this paper we extend the data-exchange approach of [13], also used in [27], to solve the problem of finding certain query answers under CWA, for CQ queries and views in presence of weakly acyclic embedded dependencies. The approaches of this current paper do not use “target-to-source dependencies” introduced in the data-exchange context in [15]. The dependencies in our approaches do use constants (as was suggested back in [13]), and thus are related to “conditional dependencies” [14]. Conditional dependencies are intuitively understood as enforcing a (perhaps constant-involving) pattern onto (typically constant-determined) subsets of the given relations. As the dependencies that we use do not have constants in their antecedents, they intuitively behave in the ways expected of standard (constant-free) embedded dependencies.

While the term “data exchange” is mentioned in the paper [20] by Miklau and Suciu, data-exchange methods are not used in the technical development in [20]. Rather, the term is used in [20] informally as a reference to today’s universal sharing of data (as in, e.g., on the Web). [20] addresses the problem of “data publishing,” in which the goal is to determine, for a given set of view definitions and for a “secret query” , whether any materializations of the given views would disclose information about any answers to . (In contrast, in the three problems considered in this current paper, we assume that a specific set of view materializations is provided in the problem input.) Further, the notion of disclosure in [20], inspired by Shannon’s notion of perfect secrecy [25], is as follows: There is no disclosure of query via views if and only if the probability of an adversary guessing the answer to is the same (or, in another scenario, is almost the same) whether the adversary knows the answers to or not. In this current paper, we use a deterministic, rather than probabilistic, notion of disclosure of a query answer, in presence of a specific set of view materializations; this leads to different security decisions than those following from [20].

The work [31] by Zhang and Mendelzon introduced and solved the problem of “conditional containment” between two CQ queries in presence of materialized CQ views, under CWA and in the absence of dependencies. [31] also introduced a security-relevant complexity metric, under which their problem is complete. ([31] also provides an excellent overview of the connections of the query-containment problem of [31] to database-theory literature.) In our work, we add dependencies to the formulation of the problem of [31], and extend the approach of [31], both to solve the resulting problem in the CQ weakly acyclic setting and to analyze the complexity of the problem. We also uncover a tight relationship of the problem with the problems of finding and determining certain query answers, under CWA in presence of view materializations and of dependencies.

3 Preliminaries

3.1 Instances and Queries

Schemas and instances. A schema P is a finite sequence , of relation symbols, with each having a fixed arity . An instance of P assigns to each P a finite -ary relation , which is a set of tuples. For tuple membership in relation , we use the notation . Each element of each tuple in an instance belongs to one of two disjoint infinite sets of values, Const and Var. We call elements of Const constants, and denote them by lowercase letters , , , ; the elements of Var are called (labeled) nulls, denoted by symbols , , , .

Sometimes we use the notation instead of , and call a fact of . When all the values in are constants, we say that is a ground fact, and is a ground tuple. The active domain of instance , denoted , is the set of all the elements of Const Var that occur in any facts in . When each fact in is a ground fact, we call a ground instance.

Queries. We consider the class of queries called “unions of conjunctive queries with disequalities,” queries. In the definitions for queries, we will use the following notions of relational atom and of (dis)equality atom. Let Qvar be an infinite set of values disjoint from Const Var; we call Qvar the set of (query) variables. We will denote variables by uppercase letters , , . Then , with a -ary relation symbol and a -vector of values, is a relational atom whenever each value in is an element of Const Qvar. Further, an equality (resp. disequality) atom is a built-in predicate of the form , where is (resp. ), and each of and is an element of Const Qvar.

A -rule over schema P, with -ary ( ) output relation symbol P, is an expression of the form

Here, ; the vector has elements; for each , P; each of , is a relational atom; and is a (possibly empty) finite conjunction of disequality atoms. We consider only safe rules: That is, each variable in , as well as each variable occurring in , also occurs in at least one of , , . All the variables of the rule that do not appear in (i.e., the nonhead variables of the rule) are assumed to be existentially quantified. We call the atom the head of the rule, call the head vector of the rule, and call the conjunction of its remaining atoms the body of the rule. Each atom in the body of a rule is called a subgoal of the rule. The conjunction in the body is usually written using commas, as

A conjunctive query with disequalities (a query) is a query defined by a single -rule; a conjunctive query (a CQ query) is a query with an empty . We will be referring to a query with head as just , or even , whenever clear from the context. We will be using as a concise name for the body of the (rule for) .

Finally, for a -ary relation symbol , with , let , be a finite set of -rules over schema P, such that is the output relation symbol in each rule. Then we say that the set defines a query over P, and that each element of defines a component of . In the special case where , we say that the corresponding query is a trivial query.

Semantics of queries. We now define the semantics of a query . In the definition, we will need the notions of homomorphism and of valuation. Consider two conjunctions, and , of relational atoms. Then a mapping from the set of elements of to the set of elements of is called a homomorphism from to whenever (i) for each constant in , and (ii) for each conjunct of the form in , the relational atom is a conjunct in . (For a vector , for some , we define as the vector . By convention, a homomorphism is an identity mapping when applied to empty vectors and to empty tuples.)

We define homomorphisms in the same way for the case where either one of and (or both) is a conjunction of facts. Further, for a conjunction of disequality atoms and two conjunctions and of relational atoms or of facts, we say that every homomorphism, , from to is also a homomorphism from to . We will denote homomorphisms by lowercase letters , , , possibly with subscripts.

Now suppose we are given a conjunction of relational atoms, a conjunction of facts, and a conjunction of disequalities over variables in and constants in Const. Suppose there is a homomorphism, , from to , such that for each atom of the form in , the values and are distinct elements of Const Var. Then we say that is a valuation from to . We will use Greek letters , , , possibly with subscripts, for valuations.

Given a -ary query and given an instance , which we interpret as a conjunction of all the facts in . Then the answer to on , denoted , is

is a valuation from to .

(When , i.e., is a Boolean query, is the empty tuple.) Further, for a query defined by rules , , and for an instance , the answer to on is the union . By convention, for every trivial query and for every instance , we have .

Query containment. A query is contained in query , denoted if for every instance . A classic result in [9] by Chandra and Merlin states that a necessary and sufficient condition for the containment for CQ queries and of the same arity, is the existence of a containment mapping from to Here, a containment mapping [9] from CQ query to CQ query is a homomorphism from to such that . By the results in [19], this containment test of [9] remains true when has built-in predicates. Thus, the same test holds in particular when is a query. It follows that, for a query and for a CQ query, determining whether is decidable. Indeed, the containment holds iff for each rule , , we have .

Canonical database. Every query can be regarded as a symbolic ground instance . is defined as the result of turning each relational atom in into a tuple in the relation . The procedure is to keep each constant in the body of , and to replace consistently each variable in the body of by a distinct constant different from all the constants in . The tuples that correspond to the resulting ground facts are the only tuples in the canonical database for , which is unique up to isomorphism.

Remark. We have defined -rules as not having explicit equality atoms in their bodies. As a result and by definition of canonical database, we are restricting our consideration to the set of all and only satisfiable -rules/queries. (A -rule/query is satisfiable iff there exists an instance such that .)

3.2 Dependencies and Chase

Embedded dependencies. We consider dependencies of the form

(1)

with and conjunctions of relational atoms, possibly with equations added. (All the variables in , are understood to be universally quantified.) Such dependencies, called embedded dependencies, are expressive enough to specify all usual integrity constraints, such as keys, foreign keys, and inclusion dependencies [2]. If is a single equation, then is an equality-generating dependency (egd). If consists only of relational atoms, then is a tuple-generating dependency (tgd). We follow [13] in allowing constants in egds and tgds. Each set of embedded dependencies without constants is equivalent to a set of tgds and egds [2]. We write if instance satisfies all elements of set of dependencies. All the sets that we refer to are finite.

Query containment under dependencies. We say that query is contained in query under set of dependencies , denoted if for every instance we have . Queries and are equivalent under , denoted if both and hold. and are equivalent (in the absence of dependencies), denoted , if .

Chase for CQ queries. Given a CQ query and a tgd as in Eq. (1); assume w.l.o.g. that has none of the variables in . The (standard [16]) chase of with is applicable if there is a homomorphism from to , such that cannot be extended to a homomorphism from to . Then, a (standard) chase step on with and is a rewrite of into a CQ query . It can be shown that and that .

We now define a (standard [16]) chase step with an egd. Assume a CQ query , as before, and an egd of the form The chase of with is applicable if there is a homomorphism from to such that . Suppose at least one of and is a variable; let w.l.o.g. be a variable. Then a chase step on with and is a rewrite of into a CQ query, , that results from replacing all occurrences of in by . Again, we have and . If, for an as above, and are distinct constants, then we say that chase with fails on . In this case, on all .

A -chase sequence (or just chase sequence, if is clear from the context) for CQ query is a sequence of CQ queries such that each query () in is obtained from by a chase step using a dependency . A chase sequence is terminating if , where is the canonical database for . In this case we denote by and say that is the (terminal) result of the chase. All chase results for a given CQ query are equivalent in the absence of dependencies [11].

Weakly acyclic dependencies [13]. Let be a set of tgds over schema T. We construct the dependency graph of , as follows. The nodes (positions) of the graph are all pairs , for T and an attribute of . We now add edges: For each tgd in , and for each that occurs in in position , and that occurs in , do the following.

  • For each occurrence of in in position , add a regular edge from to , ; and

  • For each existentially quantified variable and for each occurrence of in in position , , add a special edge from , to , .

For a set of tgds and egds, with the set of all tgds in , we say that is weakly acyclic if the dependency graph of does not have a cycle going through a special edge. Chase of CQ queries terminates in finite time under sets of weakly acyclic dependencies [13].

The following result is immediate from [2, 10, 11, 18].

Theorem 3.1

Given CQ queries , and a set of embedded dependencies. Then iff in the absence of dependencies.

Chase of instance. Let be an instance of schema P, and a set of egds and tgds; we interpret as a conjunction of its facts. We follow [11] in defining chase of with in the same way as chase of a CQ query with . That is, in the chase steps we treat each distinct null in as a distinct variable (in the chase for CQ queries). Further, each chase step with a tgd that has existential variables introduces, in the result of the chase step, a distinct new null for each existential variable of the tgd. Chase sequences and chase termination are also defined in the same way as for CQ queries; the result of the chase of with always satisfies , that is, .

4 The Problem Statements

In this section we formalize the problems of finding and determining certain query answers and of query containment, under CWA and in presence of dependencies. We then establish a direct relationship between the latter two problems in the case of CQ view definitions.

4.1 Certain Query Answers and Query Containment w.r.t. Views and Dependencies

We begin by introducing the notion of “materialized-view setting” (“setting” for short). Suppose that we are given a schema P and a set of dependencies on P. Let be a finite set of relation symbols not in P, with each symbol (view name) of some arity . Each is associated with a -ary query on the schema P. We call a set of views on P, and call the query for each the definition of the view , or the query for . We assume that the query for each is associated with ( in) the set . We call a ground instance of schema a set of view answers for .

Let be a ground instance of schema P. We say that is a -valid base instance for and , denoted by , whenever (a) , and (b) the answer to the query for on the instance is identical to the relation in , for each . (This is the closed-world assumption (CWA), as defined in, e.g., [1], with an added requirement that .) Further, we say that is a -valid set of view answers for , denoted by , whenever there exists a -valid base instance for and .

Definition 4.1

(Materialized-view setting ) Given a schema P, a set of dependencies without constants on P, a set of views on P, and a (-valid) set of view answers for : We call (P, , , ) the (valid) materialized-view setting for P, , , and .

Let be a materialized-view setting P, , , , and let be a query over P. We define the set of certain answers of w.r.t. the setting as

| s.t. in

That is, the set of certain answers of a query w.r.t. a setting is understood, as usual, as the set of all tuples that are in the answer to the query on all the instances relevant to the setting. (Cf. [1] for the case .)

Definition 4.2

(Certain-query-answer problem in a materialized-view setting) Given a setting P, , , , a -ary ( ) query over the schema P in , and a ground -tuple . Then the certain-query-answer problem for and in is to determine whether .

It is easy to show that a tuple can be a certain answer to a query in a setting only if all the values in are in , which denotes the set of constants occurring in . (For a given materialized-view setting P, , , , we define as the union of with the set of all the constants used in the definitions of the views .) By this observation, in Definition 4.2 we can restrict our consideration to the tuples with this property.

The problem as in Definition 4.2, the problem of determining certain query answers, will be featured in our characteristic of the relationship between the extensions of the problems of [1] and of [31] to the case of dependencies under CWA. We will also consider the problem of finding the set of certain query answers w.r.t. a setting: Given a setting and a query , find the set of certain answers of w.r.t. . In Sections 56, we will introduce algorithms for solving the “CQ weakly acyclic case” of this problem and of the problem of Definition 4.2. The CQ weakly acyclic case of each problem is the case where: (i) each is conjunctive (i.e., all the views in are defined as CQ queries) and weakly acyclic (i.e., in is a set of weakly acyclic embedded dependencies), and (ii) each is a CQ query.

We now turn our attention to the problem of query containment w.r.t. a setting . Our Definition 4.3 extends the formalization of this problem due to [31], to the case of dependencies on the relevant base instances.

Definition 4.3

(-conditional query containment) Given a materialized-view setting and queries and over the schema P in . Then we say that is -conditionally contained in , denoted , iff for each instance s.t. in , we have . Further, the problem of -conditional containment for and is to determine whether .

4.2 An Illustration

In this subsection we recast Example 1.1 into the formal terms of Section 4.1. The results of this paper permit us to obtain correct solutions to all the three problems formulated at the end of Example 4.1.

Example 4.1

The setting outlined in Example 1.1 uses the schema222We abbreviate the relation names of Example 1.1 using the first letter of each name. P , , and a weakly acyclic set of dependencies, with as follows:

: .

Further, , , is the set of CQ views in , with the view definitions as follows:

Finally, for brevity we encode the constants of Example 1.1 as for johnDoe, for sales, and for 50000. Then the set of view answers of Example 1.1 can be recast for as

Now that we have specified a CQ weakly acyclic setting , consider the CQ query of Example 1.1:

Consider another CQ query, , defined as follows:

For the , , and as above and for a tuple , we have the following problems as in Section 4.1:

  1. The certain-query-answer problem for and in is “Is a certain answer of w.r.t. ?”

  2. The problem of finding the set of certain answers to w.r.t. is “Return the set for and ”; and, finally,

  3. The problem of -conditional containment for and is “Does hold?”

4.3 Relationship between the Problems

We now establish a direct relationship between the certain-query-answer problem for a given , , and , and the problem of -conditional containment for and , for the same and . (We prove the relationship for the case where all the views are defined as CQ queries.) Here, the query is constructed from the given , , and . A similar relationship was observed in [1] between the certain-query-answer problem, for a range of query and view languages in the dependency-free case under OWA, and unconditional ( ) query containment. In contrast, our result holds under CWA, in presence of dependencies, and involves -conditional query containment. Due to this result, the algorithm that we introduce in Section 5 for checking -conditional containment, can also be used to solve the certain-query-answer problem, in the CQ weakly acyclic case of each problem. (The CQ weakly acyclic case of the containment problem covers CQ weakly acyclic settings and CQ input queries.)

We formulate the main result of this section, Theorem 4.1, using the following notation. For a set , , of CQ views and for a set of view answers for , consider the conjunction

The conjunction is over all the ground facts in the set . (For each , the relation in is of cardinality .) That is, we treat each ground fact in as a relational atom, and is the conjunction of all these relational atoms. (For each such that , we define .)

Observe that can be treated as the body of a CQ query over the schema . Thus, we can use the view definitions in to do the standard expansion (as in a rewriting [19]) of into a conjunction of atoms, , over the schema P. We call the expansion of over P. As an illustration, in the setting of Example 4.1, is , and is the body of the query in the example.

We now formulate Theorem 4.1. (Due to the page limit, the straightforward proof and other details can be found in Appendix B.) This result says that for a valid CQ materialized-view setting and for an arbitrary query and an arbitrary ground tuple , there exists a (constructible) CQ query such that the certain-query-answer problem for and in is the problem of -conditional containment for and .

Theorem 4.1

Given a valid CQ materialized-view setting P, , , , a -ary ( ) query defined in an arbitrary query language over P, and a -tuple of values in . Consider the CQ query . Then if and only if is -conditionally contained in .

Whenever determining validity of a setting is decidable (as is the case for, e.g., CQ weakly acyclic settings, via our view-verified data-exchange approach of Section 6, see Appendix J.3), not being valid implies that for every query .

5 The Query-Containment Problem

In this section we outline our approach to solving the problem of -conditional query containment. (See Definition 4.3.) We show that this approach is a correct algorithm for the CQ weakly acyclic case of the problem. Thus, our algorithm extends to the case of weakly acyclic dependencies the solution of [31] for their problem of conditional containment between CQ queries in presence of materialized CQ views.333A full version of [31], including proofs of its results, has never been published. We show that our extension of the method of [31] is not trivial. By Theorem 4.1, the approach reported in this section is also a correct algorithm for the CQ weakly acyclic cases of the certain-query-answer problem.

5.1 Intuition and Discussion

We begin by sketching our containment-checking approach via an extended example. The example illustrates, in particular, how disequalities and disjunction may arise in the chase of a CQ query in this approach.

Example 5.1

Consider CQ queries and :

Consider a dependency (full tgd) on the schema P , a view , and an instance , as follows.

.

Let us specify a setting as P, . The setting is CQ weakly acyclic by definition.

By the results reviewed in Section 3, the query is not unconditionally contained in , either in the absence of dependencies or in presence of . At the same time, by our results in this section, is -contained in . Our approach to proving it is by chasing the query using a “-transformation,, of the given tgd on the schema P, as well as “-induced dependencies.” (We introduce both kinds of dependencies in Section 5.2.) The first step of the approach is to conjoin the body of with (see Section 4.3 for the definition of ):

Now the only -induced dependency, , is

.

It says that, for each subgoal of the form that could arise in the chase of with the dependencies and : Either (i) the subgoal must become , which would (correctly) give rise to in , or (ii) must be accompanied by the disequality , to prevent atoms of the form , where is a constant not equal to , from arising in . (These requirements must be satisfied for our approach to be correct, see Proposition 5.3 in Section 5.3.)

The chase of with produces a query:

(We then drop the duplicate.)

Now the dependency , which we obtain from the tgd , is Applying to the above query yields the result of chasing the query with the dependencies and :

Now the results of [19] can be used to ascertain the unconditional containment of in the query . We conclude that the query is -contained in .

Finally, suppose that we change the query slightly, by replacing its subgoal with . Then the same procedure as above can be used to show that the resulting query would not -contain the query .

In some particularly simple cases, queries can be CQ queries; see Appendix F. In general in our approach, queries are queries.

In our proposed approach for checking -containm- ent of CQ queries, the intuition is the same as in cheching query containment in presence of dependencies [2, 10, 11, 18] (see Section 3). That is, to determine if a query is contained in query on a set of instances that are “relevant” to a set of view answers , we chase to transform it into a query, , which is equivalent, by construction, to on all the relevant instances. (The “relevant instances” are the -valid base instances for the given and .) In addition to this property, the query , by its construction, “exhibits the flavor of the relevant instances,” in a very precise sense (see Proposition 5.3 in Section 5.3). These properties permit us to use a test for unconditional containment of in to correctly determine whether the original query is contained in w.r.t. all the relevant instances. (See Theorem 5.1 in Section 5.3.)

Zhang and Mendelzon in their paper [31] did precisely the above chase, with precisely the same goals and results, in the special case where no dependencies hold on the relevant instances. As an illustration, suppose that in Example 5.1 we set , while keeping the remaining inputs as they are. Then the approach of [31] for these inputs would derive the query , of that example, call this query . As is not unconditionally contained in the given query , the conclusion of [31] for these inputs would be that does not contain w.r.t. these inputs with .

Thus, in this current work we build directly on the ideas and techniques of [31]. At the same time, [31] does not make the chase process explicit, in the way in which it is explicit in the work (e.g., [2, 10, 11, 18]) on determining containment of queries in presence of dependencies. In particular, the paper [31] does not introduce dependencies that look like in Example 5.1. As a result, the authors of [31] do not have to deal with the (arguably inelegant) extensions of embedded dependencies to dependencies that may have disjunction and disequalities on the right-hand side. (Appendix C provides some details of the approach of [31].)

In this current paper, when extending the approach of [31] to the case of dependencies holding on the instances of interest, it has proved convenient for us to make explicit the -induced dependencies, such as in Example 5.1. Thus, in this work we introduce (in Section 5.2) dependencies that have both disjunctions and disequalities on the right-hand side. Disequalities in dependencies are necessary in our approach for determining -conditional containment, see Section 5.3. (As a side note, we will see in Section 6 that disequalities in dependencies are not necessary for essentially the same approach to work correctly when solving the problem of finding the set of certain answers to a CQ query w.r.t. a CQ weakly acyclic materialized-view setting.)

Not surprisingly, for CQ weakly acyclic settings and CQ queries and of interest, does not necessarily imply any of the following:

  • ;

  • ; and

  • ; here, by we denote the result of replacing by in .

(See Appendix E for all the details.)

5.2 The Dependencies and Chase Rules

We now introduce dependencies that are used in the algorithm of Section 5.3. The input to each run of the algorithm is a triple of the form , with a CQ weakly acyclic setting, and and two CQ queries. We call such triples CQ weakly acyclic input instances. For each , the algorithm determines whether holds. To make the determination, a modification (via adding ) of the query is chased with the dependencies that we introduce in the current subsection.

Building blocks for the chase

All the dependencies used in Section 5.3 are constructed using the input CQ setting . (For ease of exposition, in the remainder of this subsection we will assume that one such setting P, , , is fixed.) The construction uses normalized versions of conjunctions of relational atoms (see, e.g., [30]). That is, let be a conjunction of relational atoms. We replace in each duplicate occurrence of a variable or constant with a fresh distinct variable name. As we do each replacement, say of (or ) with , we add to the conjunction the equality atom (or ). As an illustration, if , then its normalized version is . By construction, the normalized version of each is unique up to variable renamings. For the normalized version of a conjunction , we will denote by the conjunction of all the relational atoms in , and will denote by the conjunction of all the equality atoms in . (If has no equality atoms, we set to .)

A non-egd (negd) is a dependency of the form

(2)

Here, is a conjunction of relational atoms, and each of and is an element of the set of variables .

We also use chase with “implication constraints,” see, e.g., [30]. An implication constraint (ic) is a dependency of the form , with a conjunction of relational atoms.

The algorithm of Section 5.3 performs chase of queries with ics, negds, egds, and tgds, by the following rules. Let be a query. We say that chase of with an ic is applicable whenever there exists a homomorphism, , from the antecedent of to the body of . Then we say that the chase step of with fails. Similarly, we say that a chase step with a negd (as in Eq. (2)) applies to if there exists a homomorphism, , from the antecedent of to the body of . There are two cases: One, and are the same variable (or the same constant) in . Then we say that the chase step of with fails. Otherwise, we form from the result of the chase step: is a query obtained by conjoining with the atom . Chase steps with tgds are defined for queries in the same way as for CQ queries, see Section 3.2. Finally, for chase with egds, we extend the rules of Section 3.2 by requiring that whenever chase of a query with an egd is applicable, with some homomorphism , and the consequent of is of the form , then the chase step of with fails iff has the atom (or ). (This generalizes the chase-step rule for CQ queries with egds, in the part where and are distinct constants, see Section 3.2.) As we define queries as not having explicit equality atoms, our extended chase-step rules cover all possible cases for queries.

Dependencies for CQ setting

We now introduce one type of dependencies, -induced dependencies , to be used in the chase in the algorithm of Section 5.3. For the CQ setting with set of views, let be a -ary ( ) view with definition . We first normalize the body of into . The result of negating is (obviously) a disjunction of disequality atoms. (E.g., is .) We now proceed for as follows.

If , we define the -induced generalized implication constraint (-induced gic) for as

(3)

Now suppose and , , , , with . Then we define the -induced generalized negd (-induced gnegd) for as

(4)

Here, is the head vector of the query for , with Const Qvar for . (By definition of , all the elements of occur in .) For each and for the ground tuple , we abbreviate by the conjunction . -induced gnegds are a straightforward generalization of disjunctive egds of [12, 13], with negds added “on top.”

For a CQ setting with set of view answers , the set of -induced dependencies for is the set of -induced gnegds and -induced gics constructed for all the views in as specified above.444We have shown that it is not necessary to use -induced dependencies for Boolean views with .

Dependencies for CQ setting

We now outline how to obtain from the given CQ setting the second set of dependencies, , to be used in chase in the algorithm of Section 5.3. We convert each dependency in (in the given ) using a conversion rule that follows, and then produce as the union of the outputs. The conversion rule for a dependency of the form converts into , and then returns

.

Chase of queries with

We now define chase of queries with the dependencies . For the fixed , let be a query over the schema P in . Our definition of the chase steps can be seen as an extension of the definition of [13] for their disjunctive egds, once we postulate that chase steps are to be applied to queries, rather than to instances as is done in [13]. Intuitively, we view each dependency , of the form <