Algorithms for Provisioning Queries and Analytics

# Algorithms for Provisioning Queries and Analytics

Sepehr Assadi Department of Computer and Information Science, University of Pennsylvania. Supported in part by National Science Foundation grants CCF-1116961, CCF-1552909, and IIS-1447470 and an Adobe research award.
Sanjeev Khanna11footnotemark: 1    Yang Li11footnotemark: 1    Val Tannen111Department of Computer and Information Science, University of Pennsylvania. Supported in part by National Science Foundation grants IIS-1217798 and IIS-1302212. Email: val@cis.upenn.edu.
###### Abstract

Provisioning is a technique for avoiding repeated expensive computations in what-if analysis. Given a query, an analyst formulates hypotheticals, each retaining some of the tuples of a database instance, possibly overlapping, and she wishes to answer the query under scenarios, where a scenario is defined by a subset of the hypotheticals that are “turned on”. We say that a query admits compact provisioning if given any database instance and any hypotheticals, one can create a poly-size (in ) sketch that can then be used to answer the query under any of the possible scenarios without accessing the original instance.

In this paper, we focus on provisioning complex queries that combine relational algebra (the logical component), grouping, and statistics/analytics (the numerical component). We first show that queries that compute quantiles or linear regression (as well as simpler queries that compute count and sum/average of positive values) can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision. In contrast, exact provisioning for each of these statistics requires the sketch size to be exponential in . We then establish that for any complex query whose logical component is a positive relational algebra query, as long as the numerical component can be compactly provisioned, the complex query itself can be compactly provisioned. On the other hand, introducing negation or recursion in the logical component again requires the sketch size to be exponential in . While our positive results use algorithms that do not access the original instance after a scenario is known, we prove our lower bounds even for the case when, knowing the scenario, limited access to the instance is allowed.

## 1 Introduction

“What if analysis” is a common technique for investigating the impact of decisions on outcomes in science or business. It almost always involves a data analytics computation. Nowadays such a computation typically processes very large amounts of data and thus may be expensive to perform, especially repeatedly. An analyst is interested in exploring the computational impact of multiple scenarios that assume modifications of the input to the analysis problem. Our general aim is to avoid repeating expensive computations for each scenario. For a given problem, and starting from a given set of potential scenarios, we wish to perform just one possibly expensive computation producing a small sketch (i.e., a compressed representation of the input) such that the answer for any of the given scenarios can be derived rapidly from the sketch, without accessing the original (typically very large) input. We say that the sketch is “provisioned” to deal with the problem under any of the scenarios and following [17], we call the whole approach provisioning. Again, the goal of provisioning is to allow an analyst to efficiently explore a multitude of scenarios, using only the sketch and thus avoiding expensive recomputations for each scenario.

In this paper, we apply the provisioning approach to queries that perform in-database analytics [28]222In practice, the MADlib project [2] has been one of the pioneers for in-database analytics, primarily in collaboration with Greenplum DB [1]. By now, major RDBMS products such as IBM DB2, MS SQL Server, and Oracle DB already offer the ability to combine extensive analytics with SQL queries. . These are queries that combine logical components (relational algebra and Datalog), grouping, and numerical components (e.g., aggregates, quantiles and linear regression). Other analytics are discussed under further work.

Abstracting away any data integration/federation, we will assume that the inputs are relational instances and that the scenarios are defined by a set of hypotheticals. We further assume that each hypothetical indicates the fact that certain tuples of an input instance are retained (other semantics for hypotheticals are discussed under further work).

A scenario consists of turning on/off each of the hypotheticals. Applying a scenario to an input instance therefore means keeping only the tuples retained by at least one of the hypotheticals that are turned on. Thus, a trivial sketch can be obtained by applying each scenario to the input, solving the problem for each such modified input and collecting the answers into the sketch. However, with hypotheticals, there are exponentially (in ) many scenarios. Hence, even with a moderate number of hypotheticals, the size of the sketch could be enormous. Therefore, as part of the statement of our problem we will aim to provision a query by an algorithm that maps each (large) input instance to a compact (essentially size ) sketch.

#### Example.

Suppose a large retailer has many and diverse sales venues (e.g., its own stores, its own web site, through multiple other stores, and through multiple other web retailers). An analyst working for the retailer is interested in learning, for each product in, say, “Electronics”, a regression model for the way in which the revenue from the product depends on both a sales venue’s reputation (assume a numerical score) and a sales venue commission (in %; 0% if own store). Moreover, the analyst wants to ignore products with small sales volume unless they have a large MSRP (manufacturer’s suggested retail price). Usually there is a large (possibly distributed/federated) database that captures enough information to allow the computation of such an analytic query. For simplicity we assume in this example that the revenue for each product ID and each sales venue is in one table and thus we have the following query with a self-explanatory schema:

SELECT x.ProdID, LIN_REG(x.Revenue, z.Reputation, z.Commission) AS (B, A1, A2)
FROM  RevenueByProductAndVenue x
INNER JOIN Products y ON x.ProdID=y.ProdID
INNER JOIN SalesVenues z ON x.VenueID=z.VenueID
WHERE y.ProdCategory="Electronics" AND (x.Volume>100 OR y.MSRP>1000)
GROUP BY x.ProdID


The syntax for treating linear regression as a multiple-column-aggregate is simplified for illustration purposes in this example. Here the values under the attributes B,A1,A2 denote, for each ProdID, the coefficients of the linear regression model that is learned, i.e., Revenue = B + A1*Reputation + A2*Commission.

A desirable what-if analysis for this query may involve hypotheticals such as retaining certain venue types, retaining certain venues with specific sales tax properties, retaining certain product types (within the specified category, e.g., tablets), and many others. Each of these hypotheticals can in fact be implemented as selections on one or more of the tables in the query (assuming that the schema includes the appropriate information). However, combining hypotheticals into scenarios is problematic. The hypotheticals overlap and thus cannot be separated. With (say) hypotheticals there will be (in practice at least hundreds) of regression models of interest for each product. Performing a lengthy computation for each one of these models is in total very onerous. Instead, we can provision the what-if analysis of this query since the query in this example falls within the class covered by our positive results.

#### Our results.

Our goal is to characterize the feasibility of provisioning with sketches of compact size (see Section 2 for a formal definition) for a practical class of complex queries that consist of a logical component (relational algebra or Datalog), followed by a grouping component, and then by a numerical component (aggregate/analytic) that is applied to each group (a more detailed definition is given in Section 5).

The main challenge that we address, and the part where our main contribution lies, is the design of compact provisioning schemes for numerical queries, specifically linear () regression and quantiles. Together with the usual count, sum and average, these are defined in Section 4 as queries that take a set of numbers or of tuples as input and return a number or a tuple of constant width as output. It turns out that if we expect exact answers, then none of these queries can be compactly provisioned. However, we show that compact provisioning schemes indeed exist for all of them if we relax the objective to computing near-exact answers (see Section 2 for a formal definition). The following theorem summarizes our results for numerical queries (see Section 4):

###### Theorem 1.1 (Informal).

The quantiles, linear () regression, count, and sum/average (of positive numbers) queries can be compactly provisioned to provide (multiplicative) approximate answers to an arbitrary precision, while their exact provisioning requires the sketch size to be exponential in the number of hypotheticals.

Our results on provisioning numerical queries can then be used for complex queries, as the following theorem summarizes (see Section 5):

###### Theorem 1.2 (Informal).

Any complex query whose logical component is a positive relational algebra query can be compactly provisioned to provide an approximate answer to an arbitrary precision as long as its numerical component can be compactly provisioned for the same precision, and as long as the number of groups is not too large. On the other hand, introducing negation or recursion in the logical component requires the sketch size to be exponential in the number of hypotheticals.

#### Our techniques.

At a high-level, our approach for compact provisioning can be described as follows. We start by building a sub-sketch for each hypothetical by focusing solely on the retained tuples of each hypothetical individually. We then examine these sub-sketches against each other and collect additional information from the original input to summarize the effect of appearance of other hypotheticals to each already computed sub-sketch. The first step usually involves using well-known (and properly adjusted) sampling or sketching techniques, while the second step, which is where we concentrate the bulk of our efforts, is responsible for gathering the information required for combining the sketches and specifically dealing with overlapping hypotheticals. Given a scenario, we answer the query by fetching the corresponding sub-sketches and merging them together; the result is a new sketch that act as sketch for the input consist of the union of the hypotheticals.

We prove our lower bounds by first identifying a central problem, i.e., the Coverage problem (see Problem 1), with provably large space requirement for any provisioning scheme, and then we use reductions from this problem to establish lower bounds for other queries of interest. The space requirement of the Coverage problem itself is proven using simple tools from information theory (see Theorem 3.1).

#### Comparison with existing work.

Our techniques for compact provisioning share some similarities with those used in data streaming and in the distributed computation models of [15, 14, 39] (see Section 6 for further discussion and formal separations), and in particular with linear sketching, which corresponds to applying a linear transformation to the input data to obtain the sketch. However, due to overlap in the input, our sketches are required to to be composable with the union operation (instead of the addition operation obtained by linear sketches) and hence linear sketching techniques are not directly applicable.

Dealing with duplicates in the input (similar to the overlapping hypotheticals) has also been considered in streaming and distributed computation models (see, e.g., [13, 10]), which consider sketches that are “duplicate-resilient”. Indeed, for simple queries like count, a direct application of these sketches is sufficient for compact provisioning (see Section 4.1). We also remark that the Count-Min sketch [12] can be applied to approximate quantiles even in the presence of duplication (see [10]), i.e., is duplicate-resilient. However, the approximation guarantee achieved by the Count-Min sketch for quantiles is only additive (i.e., ), in contrast to the stronger notion of multiplicative approximation (i.e., ) that we achieve in this paper. To the best of our knowledge, there is no similar result concerning duplicate-resilient sketches for multiplicative approximation of quantiles or the linear regression problem, and existing techniques do not seem to be applicable for our purpose. Indeed one of the primary technical contributions of this paper is designing provisioning schemes that can effectively deal with overlapping hypotheticals for quantiles and linear regression.

#### Further related work.

Provisioning, in the sense used in this paper, originated in [17] together with a proposal for how to perform it taking advantage of provenance tracking. Answering queries under hypothetical updates is studied in [21, 5] but the focus there is on using a specialized warehouse to avoid transactional costs. (See also [17] for more related work.)

Estimating the number of distinct elements (corresponding to the count query) has been studied extensively in data streams [20, 4, 7, 31] and in certain distributed computation models  [15, 14, 39]. For estimating quantiles,  [33, 22, 25, 12, 26, 41] achieve an additive error of for an input of length , and [27, 11] achieve a (stronger guarantee of) -approximation. Sampling and sketching techniques for -regression problem have also been studied in [18, 36, 19, 8] for either speeding up the computation or in data streams (see [32, 38] for excellent surveys on this topic).

## 2 Problem Statement

#### Hypotheticals.

Fix a relational schema . Our goal is to provision queries on -instances. A hypothetical w.r.t. is a computable function that maps every -instance to a sub-instance . Formalisms for specifying hypotheticals are of course of interest (e.g., apply a selection predicate to each table in ) but we do not discuss them here because the results in this paper do not depend on them.

#### Scenarios.

We will consider analyses (scenario explorations) that start from a finite set of hypotheticals. A scenario is a non-empty set of hypotheticals . The result of applying a scenario to an instance is defined as a sub-instance . In other words, under , if any is said to be turned on (similarly, any is turned off), each turned on hypothetical will retain the tuples from .

###### Definition 1 (Provisioning).

Given a query , to provision means to design a pair of algorithms: a compression algorithm that takes as input an instance and a set of hypotheticals, and outputs a data structure called a sketch, and an extraction algorithm that for any scenario , outputs using only (without access to ).

To be more specific, we assume the compression algorithm takes as input an instance and hypotheticals along with the sub-instances that they define. A hypothetical will be referred to by an index from , and the extraction algorithm will be given scenarios in the form of sets of such indices. Hence, we will refer to a scenario where by abusing the notation as . Throughout the paper, we denote by the number of hypotheticals (i.e. ), and by the size of the input instance (i.e., ).

We call such a pair of compression and extraction algorithms a provisioning scheme. The compression algorithm runs only once; the extraction algorithm runs repeatedly for all the scenarios that an analyst wishes to explore. We refer to the time that the compression algorithm requires as the compression time, and the time that extraction algorithm requires for each scenario as the extraction time.

The definition above is not useful by itself for positive results because it allows for trivial space-inefficient solutions. For example, the definition is satisfied when the sketch is defined to be a copy of itself or, as mentioned earlier, a scenario-indexed collection of all the answers. Obtaining the answer for each scenario is immediate for either case, but such a sketch can be prohibitively large as the number of tuples in could be enormous, and the number of scenarios is exponential in .

This discussion leads us to consider complexity bounds on the size of the sketches.

###### Definition 2 (Compact provisioning).

A query can be compactly provisioned if there exists a provisioning scheme for that given any input instance and any set of hypotheticals , constructs a sketch of size bits, where and .

We make the following important remark about the restrictions made in Definitions 1 and 2.

###### Remark 2.1.

At first glance, the requirement that the input instance cannot be examined at all during extraction may seem artificial, and the same might be said about the size of the sketch depending polynomially on rather than a more relaxed requirement. However, we show in Theorem 3.1 that our central lower bound result holds even if a portion of size of the input instance can be examined during extraction after the scenario is revealed, and even if the space dependence of the sketch is only restricted to be (instead of depending only polynomially on ). These additional properties transfer to all our lower bound results (i.e., Theorems 4.14.74.105.6) although we choose not to restate them in each of them. In spite of these additional properties, the positive results we obtain (i.e., Theorems 4.34.44.115.1) all use sketches whose space requirements depend polynomially only on and do not require examining the original database at all during the extraction. These calibration results further justify our design choices for compact provisioning.

Even though the definition of compact provisioning does not impose any restriction on either the compression time or the extraction time, all our positive results in this paper are supported by (efficient) polynomial time algorithms. Note that this is data-scenario complexity: we assume the size of the query (and the schema) to be a constant but we consider dependence on the size of the instance and the number of hypotheticals. Our negative results (lower bounds on the sketch size), on the other hand, hold even when the compression and the extraction algorithms are computationally unbounded.

#### Exact vs. approximate provisioning.

Definition 2 focused on exact answers for the queries. While this is appropriate for, e.g., relational algebra queries, as we shall see, for queries that compute numerical answers such as aggregates and analytics, having the flexibility of answering queries approximately is essential for any interesting positive result.

###### Definition 3 (ϵ-provisioning).

For any , a query can be -provisioned if there exists a provisioning scheme for , whereby for each scenario , the extraction algorithm outputs a approximation of , where is the input instance.

We say a query can be compactly -provisioned if can be -provisioned by a provisioning scheme that, given any input instance and any set of hypotheticals , creates a sketch of size bits.

We emphasize that throughout this paper, we only consider the approximation guarantees which are relative (multiplicative) as opposed to the weaker notion of additive approximations. The precise definition of relative approximation guarantee will be provided for each query individually. Moreover, as expected, randomization will be put to good use in -provisioning. We therefore extend the definition to cover the provisioning schemes that use both randomization and approximation.

###### Definition 4.

For any , an -provisioning scheme for a query is a provisioning scheme where both the compression and extraction algorithms are allowed to be randomized and the output for every scenario is a -approximation of with probability .

An -provisioning scheme is called compact if it constructs sketches of size only
bits, has compression time that is , and has extraction time that is .

In many applications, the size of the database is a very large number, and hence the exponent in the -dependence of the compression time might become an issue. If the dependence of the compression time on the input size is essentially linear, i.e., we say that the scheme is linear. We emphasize that in all our positive results for queries with numerical answers we give compact -linear provisioning schemes, thus ensuring efficiency in both running time and sketch size.

#### Complex queries.

Our main target consists of practical queries that combine logical, grouping, and numerical components. In Section 5, we focus on complex queries defined by a logical (relational algebra or Datalog) query that returns a set of tuples, followed by a group-by operation (on set of grouping attributes) and further followed by numerical query that is applied to each sets of tuples resulting from the grouping. This class of queries already covers many practical examples. We observe that the output of such a complex query is a set of tuples where is the number of distinct values taken by the grouping attributes. Therefore, the size of any provisioning sketches must also depend on . We show (in Theorem 5.1) that a sketch for a query that involves grouping can be obtained as a collection of sketches. Hence, if each of the sketches is of compact size (as in Definitions 2 and 4) and the value itself is bounded by , then the overall sketch for the complex query is also of compact size. Note that is typically small for the kind of grouping used in practical analysis queries (e.g., number of products, number of departments, number of locations, etc.). Intuitively, an analyst would have trouble making sense of an output with a large number of tuples.

#### Notation.

For any integer , denotes the set . The notation suppresses , , , and factors. All logarithms are in base two unless stated otherwise.

## 3 A “Hard” Problem for Provisioning

To establish our lower bounds in this paper, we introduce a “hard” problem called Coverage. Though not defined in the exact formalism of provisioning, the Coverage problem can be solved by many provisioning schemes using proper “reductions” and hence a lower bound for the Coverage problem can be used to establish similar lower bounds for provisioning various queries.

We start by describing the Coverage problem in details and then present our lower bound. In Appendix A, we survey some simple tools from information theory that we need in our lower bound proof.

### 3.1 The Coverage Problem

Informally speaking, in the Coverage problem, we are given a collection of subsets of a universe and the goal is to “compress” this collection in order to answer to the questions in which indices of some subsets in the collection are provided and we need to figure out whether these subsets cover the universe or not. We are interested in compressing schemes for this problem that when answering each question, in addition to the already computed summary of the collection, also have a limited access to the original instance (see Remark 2.1 for motivation of this modification). The Coverage problem is defined formally as follows.

###### Problem 1 (Coverage).

Suppose we are given a collection of the subsets of . The goal in the Coverage problem is to find a compressing scheme for , defined formally as a triple of algorithms:

• A compression algorithm which given the collection creates a data structure .

• An examination algorithm which given a subset of , a question, and the data structure , computes a set of indices and lookup for each and each (), whether or not. The output of the examination algorithm is a tuple , where .

• An extraction algorithm which given a question , the data structure , and the tuple , outputs “Yes”, if and “No” otherwise.

We refer to the size of , denoted by , as the storage requirement of the compression scheme and to the size of , denoted by , as the examination requirement of the scheme. The above algorithms can all be randomized; in that case, we require that for each question , the final answer (of the extraction algorithm) to be correct with a probability at least . Note that this choice of constant is arbitrary and is used only to simplify the analysis; indeed, one can always amplify the probability of success by repeating the scheme constant number of times and return the majority answer.

While Coverage is not stated in the exact formalism of provisioning, the analogy between this problem and provisioning schemes should be clear. In particular, for our lower bound proofs for provisioning schemes, we can alter the Definition 1 to add an examination algorithm and allow a similar access to the original database to the provisioning scheme.

### 3.2 The Lower Bound

We prove the following lower bound on storage and examination requirement of any compressing scheme for the Coverage problem.

###### Theorem 3.1.

Any compressing scheme for the Coverage problem that answers each question correctly with probability at least , either has storage requirement or examination requirement of bits.

Allowing access to the original input in Theorem 3.1 makes the lower bound very robust. However, due to this property, the lower bound does not seem to follow from standard communication complexity lower bounds and hence we use an information-theoretic approach to prove this theorem directly, which may be of independent interest. We note that since our other lower bounds are typically proven using a reduction from the Coverage problem, the properties in Theorem 3.1 (i.e., allowing randomization and access to the database after being given the scenario) also hold for them and we do not mention this explicitly.

In order to prove this lower bound, by Yao’s minimax principle [40] (see also [34]), it suffices to define a distribution of instances of the problem and show that any deterministic algorithm that is correct on the input chosen according to this distribution with probability , has to have either large storage requirement or large examination requirement. We define the following distribution (for simplicity assume is even).

The following claim is immediate.

###### Claim 3.2.

For any question , iff .

Fix any deterministic compressing scheme for the distribution . We define , , and as the random variables corresponding to, respectively, the data structure , the question , and the examination indices in this compressing scheme. Moreover is a random variable for the input string , is for the bijection , and () is for the subset . We use for the output of the examination algorithm.

#### Overview of the proof.

The intuition behind the proof is as follows. By Claim 3.2, in order to give a correct answer to the question , the scheme has to determine the value of correctly. Suppose first that the scheme only consists of compression and extraction algorithms, i.e., without the examination algorithm. In this case, even if we give the bijection to the extraction algorithm, the extraction algorithm has to effectively recover the value of from , where is chosen uniformly at random from . In this case, simple information-theoretic facts imply that has to be of size .

Now consider the other case where we remove the compression algorithm. In this case, even if we give the string to the examination algorithm and assume that upon examining an entry in the input, it can determine whether or not, for the extraction algorithm to be able to find the value of correctly, it better be the case that . In other words, the set of indices computed by the examination algorithm should contain the target index . However, for any fixed of size , the probability that is , and hence the extraction algorithm cannot recover the correct answer with high probability.

To prove Theorem 3.1, we have to consider schemes that consist of both compression and examination algorithms and a priori it is not clear that how much the interleaved information obtained from both these algorithms can help the extraction algorithm to compute the final answer, especially considering the fact that having a compression algorithm also helps examination algorithm in finding the “correct” index. However, using a careful analysis we separate the information given from these two algorithms to the extraction algorithm and argue that at least one of the examination or compression requirements of any scheme has to be of size .

###### Proof of Theorem 3.1.

Suppose is the random variable which is if the correct answer is Yes, and is zero otherwise and is the random variable that denotes the index of the string which determines the correct answer, i.e., (by Claim 3.2). Let be the probability of failure; we have,

 H2(δ) ≥H(θ∣D,Q,SJ) (Fano’s inequality, Claim A.1-(5)) ≥H(θ∣D,Q,SJ,I) (Conditioning reduces entropy, Claim A.1-(2) ) =H(θ∣D,Q,I)−I(θ;SJ∣D,Q,I)

We bound each term in the above equation in the following two claims separately.

Suppose ; then .

###### Proof.
 H(θ∣D,Q,I) ≥H(θ∣D,Q,I,Σ) (Conditioning reduces entropy, Claim A.1-(2) ) =H(θ∣D,I,Σ) (Q is uniquely determined by Σ and I) =n∑i=1Pr(I=i)⋅H(θ∣D,I=i,Σ) =n∑i=11n⋅H(Xi∣D,I=i,Σ) (I is uniform on [n] and θ=Xi)

Now, notice that is independent of the event , and moreover conditioned on , is a function of alone and hence is independent of . Additionally, is chosen independent of the value of ; hence is also independent of . Consequently, we can drop conditioning on and have,

 H(θ∣D,Q,I) ≥n∑i=11n⋅H(Xi∣D,Σ) ≥H(X∣D,Σ)n (By subadditivity of entropy, Claim A.1-(4)) =H(X∣Σ)−I(X,D∣Σ)n ≥H(X)−H(D)n (X⊥Σ and I(X,D∣Σ)≤H(D∣Σ)≤H(D)) ≥1−|D|n=1−o(1)

where in the last inequality we use the fact that is uniformly chosen from a domain of size and hence and (both by Claim A.1-(1)).

Suppose ; then .

###### Proof.

We first define some notation. For a fixed and , we use to denote the unique set of examined indices (recall that the compressing scheme is deterministic over the distribution ). For a triple as an assignment to , we say that the tuple is good if , where . We use the set to denote the set of all valid tuples and the set to denote the set of all good tuples. Using this notation,

 I(θ;SJ∣D,Q,I) =∑T∈XPr((D,Q,I)=T)⋅I(θ;SJ∣(D,Q,I)=T)

We decompose the above summands into two parts; one over and one over . We first argue that for a tuple , . This is because conditioned on , by Claim 3.2, and is independent of and , and hence is independent of as well whenever . For a tuple , we simply use the upper bound , where . Consequently,

 I(θ;SJ∣D,Q,I) =∑T∉XgPr((D,Q,I)=T)⋅I(θ;SJ∣(D,Q,I)=T) ≤∑T∉XgPr((D,Q,I)=T)⋅|θ| =Pr((D,Q,I)∉Xg) (|θ|=1)

We now show that . Let be the distribution of (i.e., a uniform distribution on ) and for any and , be the distribution of . Fix any set of size ; it is clear that under , . We show that the total variation distance of and (for “typical” choices of and ) is bounded away from and hence as well (using Claim A.4). To achieve the bound on the total variation distance, we instead bound the KL-divergence of from the uniform distribution (in expectation over the choice of and ) and then use Pinsker’s inequality (Claim A.3) to bound the total variation distance between these distributions. We have,

 ED,Q[D(νD,Q∥μ)] =I(I;D,Q)=H(I)−H(I∣D,Q) (by Claim A.2) =logn−∑Q∈FPr(Q=Q)⋅H(I∣D,Q=Q) =logn−∑Q∈F1n⋅H(Σ(Q)∣D) (I=Σ(Q) and Σ(Q)⊥Q=Q,D⊥Q=Q) ≤logn−H(Σ∣D)n (By subadditivity of entropy, Claim A.1-(4)) ≤logn−|Σ|−|D|n (H(Σ∣D)≥|Σ|−|D|, Claim A.1-(1)) =logn−log(n!)−o(n)n (|Σ|=logn!, |D|=o(n)) ≤logn−nlogn−nloge+O(logn)−o(n)n (by Stirling approximation of n!) =loge+o(1)

We now have,

 ED,Q[∣∣νD,Q−μ∣∣] ≤ED,Q[√12⋅D(νD,Q∥μ)] (Pinsker’s inequality, Claim A.3) ≤√12⋅ED,Q[D(νD,Q∥μ)] (Convexity of √⋅) ≤√loge2+o(1)<0.85

Fix any pair , by Claim A.4,

 PrνD,Q(i∈exam(D,Q))≤Prμ(i∈% exam(D,Q))+∣∣νD,Q−μ∣∣

By taking expectation over ,

 ED,Q[PrνD,Q(i∈exam(D,Q))] ≤ED,Q[Prμ(i∈exam(D,Q))]+ED,Q[∣∣νD,Q−μ∣∣] ≤o(1)+0.85≤0.9

Noting that the LHS in the above equation is equal finalizes the proof.

By plugging in the values from the above two claims, we have

 H2(δ)≥H(θ∣D,Q,I)−I(θ;SJ∣D,Q,I)≥1−o(1)−0.9=0.1−o(1)

which is in contradiction with the fixed value of (for ). Hence, , implying that the storage requirement or examination requirement of the scheme has to be of size . The bound of also follows from the fact that in this construction.

Notice that in the proof of Theorem 3.1, the constructed instances have a simple property (the union in each question is either or for some ). We extract this useful property and provide the following stronger version of Theorem 3.1 as a corollary (to its proof). This stronger version is used to prove a lower bound for the average query in Theorem 4.1.

###### Corollary 3.5.

Suppose the given sets in the Coverage problem are promised to have the property that for any set , is either or for some . The lower bound of Theorem 3.1 still holds for this promised version of the Coverage problem.

## 4 Numerical Queries

In this section, we study provisioning of numerical queries, i.e., queries that output some (rational) number(s) given a set of tuples. In particular, we investigate aggregation queries including count, sum, average, and quantiles (therefore min, max, median, rank, and percentile), and as a first step towards provisioning database-supported machine learning, linear regression. We assume that the relevant attribute values are rational numbers of the form where both are integers in range for some .

### 4.1 The Count, Sum, and Average Queries

In this section, we study provisioning of the count, sum, and average queries, formally defined as follows. The answer to the count query is the number of tuples in the input instance. For the other two queries, we assume a relational schema with a binary relation containing two attributes: an identifier (key) and a weight. We say that a tuple is smaller than the tuple , if the weight of is smaller than the weight of . Given an instance , the answer to the sum query (resp. the average query) is the total weights of the tuples (resp. the average weight of the tuples) in .

We first show that none of the count, sum, average queries can be provisioned both compactly and exactly, which motivates the -provisioning approach, and then briefly describe how to build a compact -linear provisioning scheme for each of them. Our lower bound results for count, sum, and average queries are summarized in the following theorem.

###### Theorem 4.1.

Exact provisioning of any of the count, sum, or average queries requires sketches of size bits.

###### Proof.

We provide a proof for each of the queries separately.

Count query. Given , where each is a subset of , we solve Coverage using a provisioning scheme for the count query. Define an instance of a relational schema with a unary relation , where . Define a set of hypotheticals, where for any , . For any scenario , the count of is iff . Hence, any provisioning scheme for the count query solves the Coverage problem and the lower bound follows from Theorem 3.1.

Sum query. The lower bound of the sum follows immediately from the one for count by setting all weights to be .

Average query. For simplicity, in the following construction we will omit the id attribute of the tuples as the weights of the tuples are distinct and can be used to identify each tuple.

Given , where each is a subset of , with the promise that for any set , is either or for some , we want to solve this restricted Coverage problem. By Corollary 3.5, the restricted Coverage problem also needs a data structure of size bits.

Define an instance of the relational schema with a unary relation , where . Define a set of hypotheticals, where for any , . For any scenario , the average weight of is (resp. for some ) if is equal to (resp. not equal to) . The two values are equal iff , which, if we assume is even, could never happen. Therefore knowing the average value is enough to solve the restricted Coverage problem, and the lower bound follows.

We further point out that if the weights can be both positive and negative, the sum (and average) cannot be compactly provisioned even approximately, and hence we will focus on -provisioning for positive weights.

###### Theorem 4.2.

Provisioning of the sum (and average) query approximately (up to any multiplicative factor) over the input instances with both positive and negative weights requires sketches of size bits.

###### Proof.

We use a reduction from the Coverage problem. Suppose we are given a collection of sets , where each is a subset of in the Coverage problem. Consider the relational schema where is binary and the second attribute of is the weight. Let . We define hypotheticals: for any , . For any set , let be the scenario consisting of ; then the total weight of is zero iff . Therefore, any multiplicative approximation to the sum query would distinguish between the zero and non-zero cases, and hence solves the Coverage problem. The lower bound on the size of the sketch now follows from Theorem 3.1.

We conclude this section by explaining the -provisioning schemes for the count, sum, and average queries. Formally,

###### Theorem 4.3 (ϵ-provisioning count).

For any , there exists a compact -linear provisioning scheme for the count query that creates a sketch of size bits.

###### Theorem 4.4 (ϵ-provisioning sum & average).

For instances with positive weights, for any , there exists compact -linear provisioning schemes for the sum and average queries, with a sketch of size bits.

We remark that the results in Theorems 4.3 and 4.4 are mostly direct application of known techniques and are presented here only for completeness.

The count query can be provisioned by using linear sketches for estimating the -norm (see, e.g., [31]) as follows. Consider each hypothetical as an -dimensional boolean vector , where the -th entry is iff the -th tuple in belongs to . For each , create a linear sketch (using bits of space) that estimates the -norm [31]. Given any scenario , combine (i.e., add together) the linear sketches of the hypotheticals in and use the combined sketch to estimate the -norm (which is equal to the answer of count).

###### Remark 4.5.

Note that we can directly use linear sketching for provisioning the count query since counting the duplicates once (as done by union) or multiple times (as done by addition) does not change the answer. However, this is not the case for other queries of interest like quantiles and regression and hence linear sketching is not directly applicable for them.

Here, we also describe a self-contained approach for -provisioning the count query with a slightly better dependence on the parameter ( instead of ).

We use the following fact developed by [7] in the streaming model of computation to design our scheme. For a bit string , denote by trail the number of trailing ’s in . Given a list of integers from the universe , a function , and an integer , the -trail of is defined as the list of the smallest values (use binary expression of ), where the duplicate elements in are counted only once.

###### Lemma 4.6 ([7]).

Given a list , with distinct elements, pick a random pairwise independent hash function , and let . If is the largest value in the -trail of and , then with probability at least , is a approximation of .

We now define our -linear provisioning scheme for the count query.

We call a sketch created by the above compression algorithm a CNT-Sketch. For each hash function and each , we record concise identifiers and the number of trailing ’s ( bits each) of at most tuples. Since at most tuples will be assigned with a concise identifier, bits suffice for describing each concise identifier. Hence the total size of a CNT-Sketch is bits. We now prove the correctness of this scheme.

###### Proof of Theorem 4.3.

Fix a scenario ; for any picked hash function , since the -th smallest value of the union of the recorded -trail, , is equal to the -th smallest value in the -trail of . Hence, by Lemma 4.6, with probability at least , is a approximation of . By taking the median over hash functions, the probability of failure is at most . If we take union bound over all scenarios, with probability at least , all scenarios could be answered with a approximation.

We now state the scheme for provisioning the sum query; the schemes for the sum and the count queries together can directly provision the average query, which finalizes the proof of Theorem 4.4. We use and extend our CNT-Sketch to -provision the sum query.

We call a sketch created by the above provisioning scheme a SUM-Sketch. Since for every hypothetical we only record the non-empty intervals, by an amortized analysis, the size of the sketch is bits. We now prove the correctness of this scheme.

###### Proof of Theorem 4.4.

For now assume that we do not perform the pruning step (line (2) of the compression phase). For each interval among the intervals, the CNT-Sketch outputs a approximation of the total number of tuples whose weight lies in the interval. Each tuple in this interval will be counted as if it has weight , which is a approximation of the original tuple weight. Therefore, we can output a approximation of the correct sum.

Now consider the original SUM-Sketch with the pruning step. We need to show that the total weight of the discarded tuples is negligible. For each hypothetical , we discard the tuples whose weights are less than , while the largest weight in is at least . Therefore, the total weight of the discarded tuples is at most

 n¯¯¯¯wγ−t≤nwγ(1+ϵ′)⌈log(1+ϵ′)(n/ϵ′)⌉≤ϵ′¯¯¯¯wγ

Since whenever is turned on by a given scenario, the sum of the weights is at least , we lose at most fraction of the total weight by discarding those tuples from . To see why we only lose an fraction over all the hypotheticals (instead of ), note that at most tuples will be discarded in the whole scenario, hence the in the inequality can be amortized over all the hypotheticals.

### 4.2 The Quantiles Query

In this section, we study provisioning of the quantiles query. We again assume a relational schema with just one binary relation containing attributes identifier and weight. For any instance and any tuple , we define the rank of to be the number of tuples in that are smaller than or equal to (in terms of the weights). The output of a quantiles query with a given parameter on an instance is the tuple with rank . Finally, we say a tuple is a -approximation of a quantiles query whose correct answer is , iff the rank of is a -approximation of the rank of .

Similar to the previous section, we first show that the quantiles query admits no compact provisioning scheme for exact answer and then provide a compact -provisioning scheme for this query.

###### Theorem 4.7.

Exact provisioning of the quantiles query even on disjoint hypotheticals requires sketches of size bits.

In the quantiles query, the parameter may be given either already to the compression algorithm or only to the extraction algorithm. The latter yields an immediate lower bound of , since by varying over , one can effectively “reconstruct” the original database. However, we achieve a more interesting lower bound for the case when is given at to the compression algorithm (i.e., a fixed for all scenarios, e.g., setting to find the median). An important property of the lower bound for quantiles is that, in contrast to all other lower bounds for numerical queries in this paper, this lower bound holds even for disjoint hypotheticals444All other numerical queries that we study in this paper can be compactly provisioned for exact answer, when the hypotheticals are disjoint..

###### Proof of Theorem 4.7.

Assume we want to prove the lower bound for any provisioning scheme for answering the median query (quantiles with fixed ).

Let . We show how to encode a bit-string of length into a database with tuples and a set of hypotheticals such that given provisioned sketch of the median query for this instance, one can recover any bit of this string with constant probability. Standard information-theoretic arguments then imply that the sketch size must have bits.

For any vector , define an instance on a relational schema with a binary relation whose second attribute is the weight. Let , where , , and (where is the largest possible value of the weight). The weights of the tuples are ordered “”. The answers of all the scenarios will be in : is the set of “padding” tuples which shifts the median towards , and will be divided into disjoint hypotheticals with different sizes (basically size for the -th hypothetical) where different scenarios over such set of hypotheticals allow us to output all the tuples in .

Formally, define the set of hypotheticals, where for any , with tuples, and . For any set , consider the scenario . Let . It is straightforward to verify that the median tuple of is . By varying from to (using the fact that size of hypotheticals are different powers of two), any tuple in will be outputted and the vector v could be reconstructed.

Note that one can extend this lower bound, by using an approach similar to Theorem 3.1, to provisioning schemes that are allowed a limited access to the original database after being given the scenario (see Section 3 for more details). We omit the details of this proof.

We now turn to prove the main theorem of this section, which argue the existence of a compact scheme for -provisioning the quantiles. We emphasize that the approximation guarantee in the following theorem is multiplicative.

###### Theorem 4.8 (quantiles).

For any , there exists a compact -linear provisioning scheme for the quantiles query that creates a sketch of size bits.

We should note that in this theorem the parameter is only provided in the extraction phase555We emphasize that we gave a lower bound for the easier case in terms of provisioning ( given at compression phase and disjoint hypotheticals), and an upper bound for the harder case ( given at extraction phase and overlapping hypotheticals).. Our starting point is the following simple lemma first introduced by [27].

###### Lemma 4.9 ([27]).

For any list of unique numbers and parameters , let ; for any target rank , if we independently sample each element with probability , then with probability at least , the rank of the -th smallest sampled element is a -approximation of .

The proof of Lemma 4.9 is an standard application of the Chernoff bound and the main challenge for provisioning the quantiles query comes from the fact that hypotheticals overlap. We propose the following scheme which addresses this challenge.