Probabilistic Database Summarization for Interactive Data Exploration

Probabilistic Database Summarization
for Interactive Data Exploration

Laurel Orr    Magdalena Balazinska    and Dan Suciu
University of Washington
Seattle, Washington, USA
{ljorr1, magda, suciu}@cs.washington.edu
Abstract

We present a probabilistic approach to generate a small, query-able summary of a dataset for interactive data exploration. Departing from traditional summarization techniques, we use the Principle of Maximum Entropy to generate a probabilistic representation of the data that can be used to give approximate query answers. We develop the theoretical framework and formulation of our probabilistic representation and show how to use it to answer queries. We then present solving techniques and give three critical optimizations to improve preprocessing time and query accuracy. Lastly, we experimentally evaluate our work using a 5 GB dataset of flights within the United States and a 210 GB dataset from an astronomy particle simulation. While our current work only supports linear queries, we show that our technique can successfully answer queries faster than sampling while introducing, on average, no more error than sampling and can better distinguish between rare and nonexistent values.

\newaliascnt

lemmatheorem \aliascntresetthelemma \newaliascntconjecturetheorem \aliascntresettheconjecture \newaliascntremarktheorem \aliascntresettheremark \newaliascntcorollarytheorem \aliascntresetthecorollary \newaliascntdefinitiontheorem \aliascntresetthedefinition \newaliascntpropositiontheorem \aliascntresettheproposition \newaliascntexampletheorem \aliascntresettheexample

1 Introduction

Interactive data exploration allows a data analyst to browse, query, transform, and visualize data at “human speed” [7]. It has been long recognized that general-purpose DBMSs are ill suited for interactive exploration [19]. While users require interactive responses, they do not necessarily require precise responses because either the response is used in some visualization, which has limited resolution, or an approximate result is sufficient and can be followed up with a more costly query if needed. Approximate Query Processing (AQP) refers to a set of techniques designed to allow fast but approximate answers to queries. All successful AQP systems to date rely on sampling or a combination of sampling and indexes. The sample can either be computed on-the-fly, e.g., in the highly influential work on online aggregation [12] or systems like DBO [14] and Quickr [16], or precomputed offline, like in BlinkDB [2] or SampleSeek [9]. Samples have the advantage that they are easy to compute, can accurately estimate aggregate values, and are good at detecting heavy hitters. However, sampling may fail to return estimates for small populations; targeted stratified samples can alleviate this shortcoming, but stratified samples need to be precomputed to target a specific query, defeating the original purpose of AQP.

In this paper, we propose an alternative approach to interactive data exploration based on the Maximum Entropy principle (MaxEnt). The MaxEnt model has been applied in many settings beyond data exploration; e.g., the multiplicative weights mechanism [11] is a MaxEnt model for both differentially private and, by [10], statistically valid answers to queries, and it has been shown to be theoretically optimal. In our setting of the MaxEnt model, the data is preprocessed to compute a probabilistic model. Then, queries are answered by doing probabilistic inference on this model. The model is defined as the probabilistic space that obeys some observed statistics on the data and makes no other assumptions (Occam’s principle). The choice of statistics boils down to a precision/memory tradeoff: the more statistics one includes, the more precise the model and the more space required. Once computed, the MaxEnt model defines a probability distribution on possible worlds, and users can interact with this model to obtain approximate query results. Unlike a sample, which may miss rare items, the MaxEnt model can infer something about every query.

Despite its theoretical appeal, the computational challenges associated with the MaxEnt model make it difficult to use in practice. In this paper, we develop the first scalable techniques to compute and use the MaxEnt model. As an application, we illustrate it with interactive data exploration. Our first contribution is to simplify the standard MaxEnt model to a form that is appropriate for data summarization (Sec. 3). We show how to simplify the MaxEnt model to be a multi-linear polynomial that has one monomial for each possible tuple (Sec. 3, Eq. (5)) rather than its naïve form that has one monomial for each possible world (Sec. 2, Eq. (2)). Even with this simplification, the MaxEnt model starts by being larger than the data. For example, the flights dataset is 5 GB, but the number of possible tuples is approximately , more than 5 GB. Our first optimization consists of a compression technique for the polynomial of the MaxEnt model (Sec 4.1); for example, for the flights dataset, the summary is below 200MB, while for our larger dataset of 210GB, it is less than 1GB. Our second optimization consists of a new technique for query evaluation on the MaxEnt model (Sec. 4.2) that only requires setting some variables to 0; this reduces the runtime to be on average below 500ms and always below 1s.

We find that the main bottleneck in using the MaxEnt model is computing the model itself; in other words, computing the values of the variables of the polynomial such that it matches the existing statistics over the data. Solving the MaxEnt model is difficult; prior work for multi-dimensional histograms [18] uses an iterative scaling algorithm for this purpose. To date, it is well understood that the MaxEnt model can be solved by reducing it to a convex optimization problem [23] of a dual function (Sec. 2), which can be solved using Gradient Descent. However, even this is difficult given the size of our model. We managed to adapt a variant of Stochastic Gradient Descent called Mirror Descent [5], and our optimized query evaluation technique can compute the MaxEnt model for large datasets in under a day.

In summary, in this paper, we develop the following new techniques:

  • A closed-form representation of the probability space of possible worlds using the Principle of Maximum Entropy, and a method to use the representation to answer queries in expectation (Sec 3).

  • Compression method for the MaxEnt summary (Sec 4.1).

  • Optimized query processing techniques (Sec 4.2).

  • A new method for selecting 2-dimensional statistics based on a modified KD-tree (Sec 4.3).

We implement the above techniques in a prototype system that we call EntropyDB and evaluate it on the flights and astronomy datasets. We find that EntropyDB can answer queries faster than sampling while introducing no more error, on average, and does better at identifying small populations.

2 Background

We summarize data by fitting a probability distribution over the active domain. The distribution assumes that the domain values are distributed in a way that preserves given statistics over the data but are otherwise uniform.

For example, consider a data scientist who analyzes a dataset of flights in the United States for the month of December 2013. All she knows is that the dataset includes all flights within the 50 possible states and that there are 500,000 flights in total. She wants to know how many of those flights are from CA to NY. Without any extra information, our approach would assume all flights are equally likely and estimate that there are flights.

Now suppose the data scientist finds out that flights leaving CA only go to NY, FL, or WA. This changes the estimate because instead of there being flights leaving CA and uniformly going to all 50 states, those flights are only going to 3 states. Therefore, the estimate becomes flights.

This example demonstrates how our summarization technique would answer queries, and the rest of this section covers its theoretical foundation.

2.1 Possible World Semantics

To model a probabilistic database, we use the slotted possible world semantics where rows have an inherent unique identifier, meaning the order of the tuples matters. Our set of possible worlds is generated from the active domain and size of each relation. Each database instance is one possible world with an associated probability such that the probabilities of all possible worlds sum to one.

In contrast to typical probabilistic databases where the probability of a relation is calculated from the probability of each tuple, we calculate a relation’s probability from a formula derived from the MaxEnt principle and a set of constraints on the overall distribution. This approach captures the idea that the distribution should be uniform except where otherwise specified by the given constraints.

2.2 The Principle of Maximum Entropy

The Principle of Maximum Entropy (MaxEnt) states that subject to prior data, the probability distribution which best represents the state of knowledge is the one that has the largest entropy. This means given our set of possible worlds, , the probability distribution is one that agrees with the prior information on the data and maximizes

where is a database instance, also called possible world. The above probability must be normalized, , and must satisfy the prior information represented by a set of expected value constraints:

(1)

where is a known value and is a function on that returns a numerical value in . One example constraint is that the number of flights from CA to WI is 0.

Following prior work on the MaxEnt principle and solving constrained optimization problems [4, 23, 20], the MaxEnt probability distribution takes the form

(2)

where is a parameter and is the following normalization constant:

To compute the parameters , we must solve the non-linear system of equations, Eq. (1), which is computationally difficult. However, it turns out [23] that Eq. (1) is equivalent to where the dual is defined as:

Furthermore, is concave, which means solving for the parameters can be achieved by maximizing . We note that is called the partition function, and its log, , is called the cumulant.

3 EntropyDB Approach

This section explains how we use the MaxEnt model for approximate query answering. We first show how we use the MaxEnt framework to transform a single relation into a probability distribution represented by . We then explain how we use to answer queries over .

3.1 Maximum Entropy Model of Data

We consider a single relation with attributes and schema where each attribute, , has an active domain , assumed to be discrete and ordered.111We support continuous data types by bucketizing their active domains. Let be the set of all possible tuples. Denoting , we have .

An instance for is an ordered bag of tuples, denoted . For each , we form a frequency vector which is a -dimensional vector222This is a standard data model in several applications, such as differential privacy [17]. , where each number represents the count of the tuple in (Fig. 1). The mapping from to is not one-to-one because the instance is ordered, and two distinct instances may have the same counts. Further, for any instance of cardinality , . The frequency vector of an instance consisting of a single tuple is denoted with a single value in the th position; i.e., forms a basis for .

While the MaxEnt principle allows us, theoretically, to answer any query probabilistically by averaging the query over all possible instances; in this paper, we limit our discussion to linear queries. A linear query is a -dimensional vector in . The answer to on instance is the dot product . With some abuse of notation, we will write when referring to and when referring to . Notice that , and, for any instance , .

Fig. 1 illustrates the data and query model. Any counting query is a vector where all coordinates are 0 or 1 and can be equivalently defined by a predicate such that ; with more abuse, we will use instead of when referring to a counting query. Other SQL queries can be modeled using linear queries, too. For example, SELECT A, COUNT() AS cnt FROM R GROUP BY A ORDER BY cnt DESC LIMIT 10 corresponds to several linear queries, one for each group, where the outputs are sorted and the top 10 returned.

Domains:

Database Instance: : 1 2 3 4 5 Query: q: SELECT COUNT(*) FROM R WHERE A = a1

Modeling Data and Query: n = 5, m = 2

Figure 1: Illustration of the data and query model

Our goal is to compute a summary of the data that is small yet allows us to approximatively compute the answer to any linear query. We assume that the cardinality of is fixed and known. In addition, we know statistics, , where is a linear query and is a number. Intuitively, the statistic asserts that . For example, we can write 1-dimensional and 2-dimensional (2D) statistics like and .

Next, we derive the MaxEnt distribution for the possible instances of a fixed size . We replace the exponential parameters with so that Eq. (2) becomes

(3)

We prove the following about the structure of the partition function :

Lemma \thelemma

The partition function is given by

(4)

where is the multi-linear polynomial

(5)
{proof}

Fix any such that . The number of instances of cardinality with is . Furthermore, for each such instance, . Therefore,

The data summary consists of the polynomial (Eq. (5)) and the values of its parameters ; the polynomial is defined by the linear queries in the statistics , and the parameters are computed from the numerical values .

Example \theexample

Consider a relation with three attributes , and assume that the domain of each attribute has 2 distinct elements. Assume and the only statistics in are the following 1-dimensional statistics:

The first statistic asserts that , etc. The polynomial is

where are variables associated with the statistics on , are for 333We abuse notation here for readability. Technically, , , and ., and are for .

Consider the concrete instance

where the tuple occurs 9 times. Then, .

Example \theexample

Continuing the previous example, we add the following multi-dimensional statistics to :

is now

(6)

The red variables are the added 2-dimensional statistic variables; we use to denote a single variable corresponding to a 2D statistics on the attributes . Notice that each red variable only occurs with its related 1-dimensional variables. , for example, is only in the same term as and .

Now consider the earlier instance . Its probability becomes .

To facilitate analytical queries, we choose the set of statistics as follows:

  • Each statistic is associated with some predicate such that . It follows that for every tuple , is either 0 or 1; therefore, each variable has degree 1 in the polynomial in Eq. (5).

  • For each domain , we include a complete set of 1-dimensional statistics in our summary. In other words, for each , contains one statistic with predicate . We denote the set of indices of the 1-dimensional statistics associated with ; therefore, .

  • We allow multi-dimensional statistics to be given by arbitrary predicates. They may be overlapping and/or incomplete; e.g., one statistic may count the tuples satisfying and another count the tuples satisfying .

  • We assume the number of 1-dimensional statistics dominates the number of attribute combinations; i.e., .

  • If some domain is large, it is beneficial to reduce the size of the domain using equi-width buckets. In that case, we assume the elements of represent buckets, and is the number of buckets.

  • We enforce our MaxEnt distribution to be overcomplete [23, pp.40] (as opposed to minimal). More precisely, for any attribute and any instance , we have , which means that some statistics are redundant since they can be computed from the others and from the size of the instance .

Note that as a consequence of overcompleteness, for any attribute , one can write as a linear expression

(7)

where each , is a polynomial that does not contain the variables . In Example 3.1, the 1-dimensional variables for are , , and indeed, each monomial in Eq. (3.1) contains exactly one of these variables. One can write as where represents the first two lines and represents the last two lines in Eq. (3.1). is also linear in , and in , .

3.2 Query Answering

In this section, we show how to use the data summary to approximately answer a linear query by returning its expected value . The summary (the polynomial and the values of its variables ) uniquely define a probability space on the possible worlds (Eq. (3) and (5)). We start with a well known result in the MaxEnt model. If is the linear query associated with the variable , then

(8)

We review the proof here. The expected value of over the probability space (Eq. (3)) is

To compute a new linear query , we add it to the statistical queries , associate it with a fresh variable , and denote the extended polynomial:

(9)

Notice that ; therefore, the extended data summary defines the same probability space as . We can apply Eq. (8) to the query to derive:

(10)

This leads to the following naïve strategy for computing the expected value of : extend to obtain and apply formula Eq. (10). One way to obtain is to iterate over all monomials in and add to the monomials corresponding to tuples counted by . As this is inefficient, Sec. 4.2 describes how to avoid modifying the polynomial altogether.

3.3 Probabilistic Model Computation

We now describe how to compute the parameters of the summary. Given the statistics , we need to find values of the variables such that for all . As explained in Sec 2, this is equivalent to maximizing the dual function :

(11)

Indeed, maximizing reduces to solving the equations for all . Direct calculation gives us , which is equivalent to by Eq. (8). The dual function is concave, and hence it has a single maximum value that can be obtained using convex optimization techniques such as Gradient Descent.

In particular, we achieve fastest convergence rates using a variant of Stochastic Gradient Descent (SGD) called Mirror Descent [5], where each iteration chooses some and updates by solving while keeping all other parameters fixed. In other words, the step of SGD is chosen to solve . Denoting and solving, we obtain:

(12)

Since is linear in each , neither nor contain any variables.

We repeat this for all , and continue this process until all differences , , are below some threshold. Algorithm 1 shows pseudocode for the solving process.

maxError = infinity
while maxError >= threshold do
  maxError = -1
  for each alpha do
      value = 
      alpha = value
      error = 
      maxError = max(error, maxError)
Algorithm 1 Solving for the s

4 Optimizations

We now discuss three optimizations: (1) summary compression in Sec. 4.1, (2) optimized query processing in Sec. 4.2, and (3) selection of statistics in Sec. 4.3.

4.1 Compression of the Data Summary

The summary consists of the polynomial that, by definition, has monomials where . We describe a technique that compresses the summary to a size closer to .

We start by walking through an example with three attributes, , , and , each with an active domain of size . Suppose first that we have only 1D statistics. Then, instead of representing as a sum of monomials, , we factorize it to ; the new representation has size .

Now, suppose we add a single 3D statistic on : . The new variable, call it , occurs in a single monomial of , namely . Thus, we can compress to .

Instead, suppose we add a single 2D range statistics on , say , and call its associated variable . This will affect monomials. We can avoid enumerating them by noting that they, too, factorize. The polynomial compresses to .

Finally, suppose we have three 2D statistics: the previous one on plus the statistics and on . Their associated variables are , , and . Now we need to account for the fact that monomials contain both and . Applying the inclusion/exclusion principle, compresses to the following (the i and ii labels are referenced later).

(13)
(14)
(15)

The size, counting only the s, s, and s for simplicity, is .

Before proving the general formula for , note that this compression is related to standard algebraic factorization techniques involving kernel extraction and rectangle coverings [13]; both techniques reduce the size of a polynomial by factoring out divisors. The standard techniques, however, are unsuitable for our use because they require enumeration of the product terms in the sum-of-product (SOP) polynomial to extract kernels and form cube matrices. Our polynomial in SOP form is too large to be materialized, making these techniques infeasible. It is future work to investigate other factorization techniques geared towards massive polynomials.

We now make the following three assumptions for the rest of the paper.

  • Each predicate has the form where is the number of attributes and is the projection of onto . If , then . For any set of indices of multi-dimensional statistics , we denote , and ; as usual, when , then .

  • Each is a range predicate .

  • For each , the multi-dimensional statistics whose attributes are exactly those in are disjoint; i.e., for , whose attributes are , for , for , and .

Using this, define 444 is the power set of for to be the set of sets of multi-dimensional statistics whose combined attributes are and whose intersection is non-empty (i.e., not false). In other words, for each , for and for .

For example, suppose we have the three 2D statistics from before: , , and . Then, and . Further, because . However, because . Using these definitions, we now give the compression.

Theorem 4.1

The polynomial is equivalent to:

The proof uses induction on the size of , but we omit it for lack of space.

To give intuition, when , we get the sum over the 1D statistics because when , equals 1. When is not empty, has one summand for each set of multi-dimensional statistics whose attributes are and whose intersection is non-empty. For each such , the summand sums up all 1-dimensional variables , that are in the th projection of the predicate (this is what the condition checks) and multiplies with terms for .

At a high level, our algorithm computes the compressed representation of by first computing the summand for when by iterating over all 1-dimensional statistics. It then iterates over the multi-dimensional statistics, and builds a map from to the attributes that are defined on ; i.e., such that for . It then iteratively loops over this map, taking the cross product of different values, and , to see if any new can be generated. If so, is added to the map. Once done, it iterates over the keys in this map to build the summands for each .

The algorithm can be used during query answering to compute the compressed representation of from (Sec. 3.2) by rebuilding ii for the new . However, as this is inefficient and may increase the size of our polynomial, our system performs query answering differently, as explained in the next section.

We now analyze the size of the compressed polynomial . Let denote the number of non-empty ; i.e., the number of unique multi-dimensional attribute sets. Since and , is dominated by . For some , part of the compression is . Part of the compression is more complex. For some , the summand is of size . As , the summand is only . Putting it together, for some , we have the size is .

is the number of sets of multi-dimensional statistics whose combined attributes are and whose intersection is non-empty. A way to think about this is that each defines a dimension in -dimensional space. Each defines a rectangle in this hyper-space. This means is the number of rectangle coverings defined by the statistics over . If we denote , then the size of the summand is .

Further, although there are possible , is non-empty for only (the 1 is from ). Therefore, the size of the compression is .

Theorem 4.2

The size of the polynomial is where is the number of unique multi-dimensional attribute sets and is the largest number of rectangle coverings defined by the statistics over some .

In the worst case, if one gathers all possible multi-dimensional statistics, this compression will be worse than the uncompressed polynomial, which is of size . However, in practice, , and is dependent on the number and type of statistics collected and results in a significant reduction of polynomial size to one closer to (see Fig. 4 discussion).

4.2 Optimized Query Answering

In this section, we assume that the query is a counting query defined by a conjunction of predicates, one over each attribute ; i.e., , where

(16)

and is a predicate over the attribute . If ignores , then we simply set . Our goal is to compute . In Sec. 3.2, we described a direct approach that consists of constructing a new polynomial and returning Eq. (10). However, as described in Sec. 3.2 and Sec. 4.1, this may be expensive.

We describe here an optimized approach to compute directly from . The advantage of this method is that it does not require any restructuring or rebuilding of the polynomial. Instead, it can use any optimized oracle for evaluating on given inputs. Our optimization has two parts: a new formula and a new formula for derivatives.

New formula for : Let be the predicate associate to the th statistical query. In other words, . The next lemma applies to any query defined by some predicate . Recall that is the new variable associated to in (Sec. 3.2).

Lemma \thelemma

For any variables of :

(1) If the logical implication holds, then

(17)

(2) If the logical equivalence holds, then

(18)
{proof}

(1) The proof is immediate by noting that every monomial of that contains all variables also contains ; therefore, all monomials on the LHS of Eq. (17) contain and thus remain unaffected by applying the operator .

(2) From item (1), we derive Eq. (17); we prove now that the RHS of Eq. (17) equals . We apply item (1) again to the implication and obtain (the role of in Eq. (17) is now played by ). As is linear, the order of partials does not matter, and this allows us to remove the operator from the RHS of Eq. (17). By repeating the argument for , , etc, we remove , then , etc from the RHS.

Corollary \thecorollary

(1) Assume is defined by a point predicate for some . For each , denote the index of the statistic associated to the value . In other words, the predicate . Then,

(19)

(2) Let be the query defined by a predicate as in Eq. (16). Then,

(20)
{proof}

(1) Eq. (19) follows from Eq. (10), Eq. (18), and the fact that . (2) Follows from (1) by expanding as a sum of point queries as in Lemma. 4.2 (1).

In order to compute a query using Eq. (20), we would have to examine all -dimensional points that satisfy the query’s predicate, convert each point into the corresponding 1D statistics, and use Eq. (19) to estimate the count of the number of tuples at this point. Clearly, this is inefficient when contains any range predicate containing many point queries.

New formula for derivatives Thus, to compute , one has to evaluate several partial derivatives of . Recall that is stored in a highly compressed format, and therefore, computing the derivative may involve nontrivial manipulations. Instead, we use the fact that our polynomial is overcomplete, meaning that , where , does not depend on any variable in (Eq. (7)). Let be any predicate on the attribute . Then,

(21)

Thus, in order to compute the summation on the left, it suffices to compute after setting to the values of all variables , that do not satisfy the predicate (this is what the condition checks).

Finally, we combine this with Eq. (20) and obtain the following, much simplified formula for answering a query , defined by a predicate of the form Eq. (16):