Optimally Leveraging Density and Locality to Support LIMIT Queries

Optimally Leveraging Density and Locality to Support LIMIT Queries

Abstract

Existing database systems are not optimized for queries with a LIMIT clause—operating instead in an all-or-nothing manner. In this paper, we propose a fast LIMIT query evaluation engine, called NeedleTail, aimed at letting analysts browse a small sample of the query results on large datasets as quickly as possible, independent of the overall size of the result set. NeedleTail introduces density maps, a lightweight in-memory indexing structure, and a set of efficient algorithms (with desirable theoretical guarantees) to quickly locate promising blocks, trading off locality and density. In settings where the samples are used to compute aggregates, we extend techniques from survey sampling to mitigate the bias in our samples. Our experimental results demonstrate that NeedleTail returns results 4 faster on HDDs and 9 faster on SSDs on average, while occupying up to 23 less memory than existing techniques.

\numberofauthors

1

1 Introduction

Users of databases are frequently interested in retrieving only a small subset of records that satisfy a query, e.g., by specifying a LIMIT clause in a SQL query. Although there has been extensive work on top-k queries that retrieve the largest or smallest results in a result set, or that tries to provide a random sample of results, there has been relatively little work on so-called any- queries that simply retrieve a few results without any requirements about the ordering or randomness of the results.

Any- has many applications in exploratory data analysis. When exploring new or unfamiliar datasets, users often issue arbitrary queries, examine a subset or sample (a “screenful”) of records, and incrementally and re-issue their queries based on observations from this sample [59, 29]. To support this form of exploratory browsing, many popular SQL IDEs implicitly limit the number of result records displayed, recognizing the fact that users often do not need to, nor do they have the time to see all of the result records. For example, phpMyAdmin1 for MySQL has a MaxRows configuration parameter; pgAdmin2 for PostgreSQL has a rowset size parameter; and SQL Developer3 for Oracle has a array fetch size parameter. Even outside of the context of SQL IDEs, in tabular interfaces such as Microsoft Excel or Tableau’s Table View, users only browse or examine a “screenful” of records at a time.

LIMIT clauses provide a way to express any- in traditional data-bases, and are supported by most modern database systems [52]. Surprisingly, despite their fundamental importance, there has been little work in executing LIMIT queries efficiently. Existing databases support LIMIT/any- by simply executing the entire query, and returning the first result records as soon as they are ready, via pipelining. This is often not interactive on large datasets, especially if the query involves selective predicates. To address this limitation, in this paper, we develop methods that allow us to quickly identify a subset of records for arbitrary any- queries.

Specifically, we develop indexing and query evaluation techniques to address the any- problem. Our new data exploration engine, NeedleTail 4, employs (i) a lightweight indexing structure, density maps, tailored for any-, along with (ii) efficient algorithms that operate on density maps and select a sequence of data blocks that are optimal for locality (i.e., how close are the blocks to each other), or optimal for density (i.e., how dense are the blocks in terms of containing relevant results). We couple these algorithms with (iii) an algorithm that is optimal for overall I/O, by employing a simple disk model, as well as a hybrid algorithm that selects between the locality and density-optimal variants. Finally, while any- is targeted at browsing, to allow the retrieved results to be used for a broader range of use-cases involving aggregation (e.g., for computing statistics or visualizations), (iv) we extend statistical survey sampling techniques to eliminate the bias in the retrieved results.

We now describe these contributions and the underlying challenges in more detail.

(i) Density Maps: A Lightweight Indexing Scheme. Inspired by bitmap indexes, which are effective for arbitrary read-only queries, but are somewhat expensive to store, we develop a lightweight indexing scheme called density maps. Like bitmaps, we store a density map for each value of each attribute. However, unlike bitmaps, which store a column of bits for each distinct value of an attribute, density maps store an array of densities for each block of records, where the array contains one entry per distinct value of the attribute. This representation allows density maps to be 3-4 orders of magnitude smaller than bitmaps, so we can easily store a density map for every attribute in the data set, and these can comfortably fit in memory even on large datasets. Note that prior work has identified other lightweight statistics that help rule out blocks that are not relevant for certain queries [62, 43, 38, 15, 24]: we compare against the appropriate ones in our experiments and argue why they are unsuitable for any- in Section 10.

(ii)-(iii) Density Maps: Density vs. Locality. In order to use density maps to retrieve records that satisfy query constraints, one approach would be to identify blocks that are likely to be “dense” in that they contain more records that satisfy the conditions and preferentially retrieve those blocks. However, this density-based approach may lead to excessive random access. Because random accesses—whether in memory, on flash, or on rotating disks—are generally orders of magnitude slower than sequential accesses, this is not desirable. Another approach would be to identify a sequence of consecutive blocks with records that satisfy the conditions, and take advantage of the fact that sequential access is faster than random access, exploiting locality. In this paper, we develop algorithms that are optimal from the perspective of density (called Density-Optimal) and from the perspective of locality (called Locality-Optimal).

Overall, while we would like to optimize for both density and locality, optimizing for one usually comes at the cost of the other, so developing the globally optimal strategy to retrieve records is non-trivial. To combine the benefits of density and locality, we need a cost model for the storage media that can help us reason about their relative benefits. We develop a simple cost model in this paper, and use this to develop an algorithm that is optimal from the perspective of overall I/O (called IO-Optimal). We further extend the density and locality-optimal algorithms to develop a hybrid algorithm (called Two-Phase) that fuses the benefits of both approaches. We integrated all four of these algorithms, coupled with indexing structures, into our NeedleTail data exploration engine. On both synthetic and real datasets, we observe that NeedleTail can be several orders of magnitude faster than existing approaches when returning records that satisfy user conditions.

(iv) Aggregate Estimation with Any- Results. In some cases, instead of just optimizing for retrieving any-, it may be important to use the retrieved results to estimate some aggregate value. Although NeedleTail can retrieve more samples in the same time, the estimate of the aggregate may be biased, since NeedleTail may preferentially sample from certain blocks. This is especially true if the attribute being aggregated is correlated with the layout of data on disk or in memory. We employ survey sampling [31, 42] techniques to support accurate aggregate estimation while retrieving any- records. We adapt cluster sampling techniques to reason about block-level sampling, along with unequal probability estimation techniques to correct for the bias. With these changes, NeedleTail is able to achieve error rates similar to pure random sampling—our gold standard—in much less time, while returning multiple orders of magnitude more records for the analyst to browse. Thus, even when computing aggregates, NeedleTail is substantially better than other schemes.

Of course, there are other techniques for improving analytical query response time, including in-memory caching, materialized views, precomputation, and materialized samples. These techniques can also be applied to any-, and are largely orthogonal to our work. One advantage of our density-map approach versus much of this related work is that it does not assume a query workload is available or that queries are predictable, which enables us to support truly exploratory data analysis. Similarly, traditional indexing structures, such as B+ Trees, could efficiently answer some any- queries. However, to support any- queries with arbitrary predicates, we would need B+ Trees on every single attribute or combination of attributes, which often will be prohibitive in terms of space. Bitmap indexes are a more space efficient approach, but even so, storing a bitmap in memory for every single value for every single attribute (10s of values for 100s of attributes) is impossible for large datasets, as we show in our experimental analysis.

Contributions and Outline. The chief contribution of this paper is the design and development of NeedleTail, an efficient data exploration engine for both browsing and aggregate estimation that retrieves samples in orders of magnitude faster than other approaches. NeedleTail’s design includes its density map indexing structure, retrieval algorithms (Density-Optimal, Locality-Optimal, Two-Phase, and IO-Optimal) with extensions for complex queries, and statistical techniques to correct for biased sampling in aggregate estimation.

We formalize the browsing problem in Section 2, describe the indexing structures in Section 3, the any- sampling algorithms in Section 4 and 5, the statistical debiasing techniques in Section 6, extensions to complex queries in Section 7, and the system architecture in Section 8. We evaluate NeedleTail in Section 9.

2 Problem Formulation

We now formally define the any- problem. We consider a standard OLAP data exploration setting where we have a database with a star schema consisting of continuous measure attributes and categorical dimension attributes . For simplicity, we focus on a single database table , with dimension attributes and measure attributes, leading to the schema: ; our techniques generalize beyond this case, as we will show in later sections. We use to denote the number of distinct values for the dimension attribute with distinct values .

Consider a selection query on where the selection condition is a boolean formula formed out of equality predicates on the dimension attributes . We define the set of records which form the result set of query to be the valid records with respect to . As a concrete example, consider a data analyst exploring campaign finance data. Suppose they want to find any- individuals who donated to Donald Trump, live in a certain county, and are married. Here, the query on has a selection condition that is a conjunction of three predicates—donated to Trump, lives in a particular county, and is married.

Given the query , traditional databases would return all the valid records for in , irrespective of how long it takes. Instead, we define an any- query as the query which returns valid records out of the set of all valid records for a given query . can be written as follows:

For now we consider simple selection queries of the above form; we show how to extend our approach to support aggregates in Section 6 and how to support grouping and joins in Section 7.

We formally state the any- sampling problem for simple selection queries as follows:

Problem 1 (any- sampling)

Given an any- query , the goal of any- sampling is to retrieve any valid records in as little time as possible.

Note that unlike random sampling, any- does not require the returned records to be randomly selected. Instead, any- sampling prioritizes query execution time over randomness. We will revisit the issue of randomness in Section 6. Next, we develop the indexing structures required to support any- algorithms.

3 Index Structure

To support the fast retrieval of any- samples, we develop a lightweight indexing structure called the DensityMap. DensityMaps share some similarities with bitmaps, so we first briefly describe bitmap indexes. We then discuss how DensityMaps address the shortcomings of bitmap indexes.

3.1 Bitmap Index: Background

Bitmap indexes [46] are commonly used for ad-hoc queries in read-mostly workloads [17, 64, 50, 65]. Typically, the index contains one bitmap for each distinct value of each dimension attribute in a table. Each bitmap is a vector of bits in which the th bit is set to , if for the th record, and 0 otherwise. If a query has a equality predicate on only one attribute value, we can simply look at the corresponding bitmap for that attribute value and return the records whose bits are set in the bitmap. For queries that have more than one predicate, or range predicates, we must perform bitwise AND or OR operations on those bitmaps before fetching the valid records. Bitwise operations can be executed rapidly, particularly when bitmaps fit in memory.

Although bitmap indexes have proven to be effective for traditional OLAP-style workloads, these workloads typically consist of queries in which the user expects to receive all valid records that match the filter. Nevertheless, bitmap indexes can be used for any- sampling. One simple strategy would be to perform the bitwise operations across all predicated bitmap indexes, perform a scan on the resulting bitmap, and return the first records with matching bits. However, the efficiency of this strategy greatly depends on the layout of the valid records. For example, if all valid records are clustered near the end of the dataset, the system would have to scan the entire bitmap index before finding the set bits. Furthermore, returning the first matching records may be sub-optimal if the first records are dispersed across the dataset, since retrieving each record would result in a random access. If a different set of records existed later in the dataset, but with better locality, a preferred strategy might be to return that second set of records instead.

In addition to some limitations when performing any- sampling, bitmap indexes often take up a large amount of space, since we need to store one bitmap per distinct value of each dimension. As the number of attribute values and dimension attributes increase, a greater number of bitmap indexes is required. Even with various methods to compress bitmaps, such as BBC [12], WAH [67], PLWAH [22], EWAH [40], density maps consume orders of magnitude less space than bitmap indexes, as we show in Section 9.

3.2 Density Map Index

We now describe how DensityMap addresses the shortcomings of bitmap indexes. Our design starts from the observation that modern hard disk drive (HDDs) typically have 4KB minimum storage units called sectors, and systems may only read or write from HDDs in whole sectors. Therefore, it takes the same amount of time to retrieve a block of data as it does a single record. DensityMaps take advantage of this fact to reason about the data at a block-level, rather than at the record-level as bitmaps do. Similarly, SSD and RAM access pages or cache lines of data at a time.

Thus, for each block, a DensityMap stores the frequency of set bits in that block, termed the density, rather than enumerating the set bits. This enables the system to “skip ahead” to these dense blocks to retrieve the any- samples. Further, by storing block-level statistics rather than record-level statistics, DensityMaps can greatly reduce the amount of indexing space required compared to bitmaps. In fact, DensityMaps can be thought of a form of lossy compression. Overall, by storing density-based information at the block level, we benefit from smaller size and more relevant information tailored to the any- sampling problem.

Formally, for each attribute , and each attribute value taken on by , , we store a DensityMap , consisting of entries, one corresponding to each block on disk. We can express as where is the percentage of tuples in block that satisfy the predicate . Note that while DensityMaps are stored per column, the actual underlying data is assumed to be stored in row-oriented fashion.

Figure 1: s
Example 1

The table in Figure 1 is stored over 9 blocks. The density map for is , indicating 20 percent of tuples in block 1 and 10 percent of tuples in block 2 have value for attribute respectively.

DensityMaps are a very flexible index structure as they can estimate the percentage of valid records for any ad-hoc query with single or nested selection constraints. For queries with more than one predicate, we can combine multiple DensityMaps together to calculate the estimated percentage of valid records per block, multiplying densities for conjunction and adding them for disjunction. In performing this estimation, we implicitly assume that the DensityMaps are independent, akin to selectivity estimation in query optimization [25]. As in query optimization, this assumption may not always hold, but as we demonstrate in our experiments on real datasets, it still leads to effective results. In particular, DensityMaps drastically reduce the number of disk accesses by skipping blocks whose estimated densities are zero (and thus definitely do not contain valid records). Some readers may be reminded of other statistics used for query optimization, such as histograms [25]. However, unlike histograms which store the overall frequencies of records for entire relations, DensityMaps store this information at a finer block-level granularity.

Example 2

In Figure 1, for a given query Q with selection constraints = AND = , the estimated DensityMap after combing and is , indicating (approximately) that block 1 has 2 percent matching records, and block 2 has 3 percent matching records for Q.

Thus, compared to bitmaps, DensityMaps are a coarser statistical summary of valid records in each block for each attribute value. DensityMaps save significant storage costs by keeping information at the block-level instead of record-level, making maintaining DensityMaps in memory feasible. Moreover, coupled with efficient algorithms, which we describe in detail next, DensityMaps can decrease the number of blocks read from disk for any- sampling and therefore reduce the query execution time.

One concern with DensityMap is that, since we admit all records from a block which satisfy the constraints into our any- sample set, the samples we retrieve may be biased with respect to the data layout. In Section 6, we describe techniques to correct the bias due to possible correlations between the samples and the data layout by applying cluster sampling and unequal probability estimation techniques.

4 Any-K Algorithms: Extremes

We introduce two algorithms which take advantage of our DensityMaps to perform fast any- sampling. The primary insights for these algorithms come from the following two observations.

First, a high density block has more valid records than a low density block. Thus, it is more beneficial to retrieve the high density block, so that overall, fewer blocks are retrieved.

Observation 1 (Density: Denser is better.)

Under the same circumstances, retrieving a block with high density is preferable to retrieving a low density block.

In a HDD, the time taken to retrieve a block from disk can be split into seek time and transfer time. The seek time is the time it takes to locate the desired block, and the transfer time is the time required to actually transfer the bytes of data in the block from the HDD to the operating system. Blocks which are far apart incur additional seek time, while neighboring blocks typically only require transfer time. Thus, retrieving neighboring blocks is preferred. Similar locality arguments hold (to varying degrees) on SSD and RAM.

Observation 2 (Locality: Closer is better)

Under the same circumstances, retrieving neighboring blocks is preferable to retrieving blocks which are far apart.

Our basic any- sampling algorithms take advantage of each of these observations: Density-Optimal optimizes for density while Locality-Optimal optimizes for locality. These two algorithms are optimal extremes, favoring just one of locality or density.

On different types of storage media, the two observations can have different amounts of impact. For example, the locality observation may not be as important for in-memory data and solid-state drives (SSDs), since the random I/O performance of these storage media is not as poor as it is on HDDs. For our purposes, we focus on the HDDs, which is the most common type of storage device, but we also evaluate our techniques on SSDs.

To judge which of these two algorithms is better in a given setting, or to combine the benefits of these two algorithms, we require a cost model for storage media, which we present in Section 5.

Table 1 provides a summary of the notation used in the following sections.

Symbol Meaning
Number of blocks
Number of predicates
Number of samples received
An empirical constant to sequentially access one block
DensityMap indicated in the WHERE clause
the th entry of DensityMap
Sorted DensityMap indicated in the WHERE clause
the th entry of the th sorted DensityMap in
Threshold
Set of block IDs with their aggregated densities
Set of block IDs seen so far
Set of block IDs returned by the algorithm
Table 1: Table of Notation

4.1 Density-Optimal Algorithm

Density-Optimal is based on the threshold algorithm proposed by Fagin et al. [23]. The goal of Density-Optimal is to use our in-memory DensityMap index to retrieve the densest blocks until valid records are found. The unmodified threshold algorithm by Fagin et al. would attempt to find the densest blocks. However, in our setting, we do not know the value of in advance: we only know , the number of valid tuples required, so we need to set the value of on the fly.

For fast execution of Density-Optimal, an additional sorted DensityMap data structure is required. For every DensityMap , we sort it in descending order of densities to create a sorted DensityMap . Every element has two attributes: , the block ID, and , the percentage of tuples in this block which satisfies the corresponding constraint. Here refers to the first block of the data and refers to the densest block in the data. Sorted DensityMaps are precomputed during data loading time and stored in memory along with the DensityMaps, so the sorting time does not affect the execution times of queries.

High-Level Intuition. At a high level, the algorithm examines each of the relevant sorted DensityMaps corresponding to the predicates in the query. It traverses these sorted DensityMaps in sorted order, while maintaining a record of the blocks with the highest overall density for the query, i.e., the highest number of valid tuples. The algorithm stops when the maintained blocks have at least valid records, and it is guaranteed that none of the unexplored blocks can have a higher overall density than the ones maintained.

Algorithmic Details. Algorithm 1 provides the full pseudocode. With sorted DensityMaps, it is easy to see how Density-Optimal handles a query with a single predicate: . Density-Optimal simply selects the which corresponds to the predicate and retrieves the first few blocks of until valid records are found. For multiple predicates, the execution of Density-Optimal is more complicated. Depending on how the predicates are combined, could mean , i.e., product, if the predicates are all combined using ANDs, or , i.e., sum, if the predicates are all combined using ORs. Each DensityMap in represents a predicate from the query, while represent the sorted variants. At each iteration, we traverse down , while maintaining a running threshold , and also keeping track of all the block ids encountered across the sorted density maps. This threshold represents the minimum aggregate density that a block must have across the predicates before we are sure that it is one of the densest blocks. During iteration , we consider all blocks in examined in the previous iterations that have not yet been selected to be part of the output. If the one with the highest density has density greater than , then it is added to the output . We know that is an upper-bound for any blocks that have not already been seen in this or the previous iterations, due to the monotonicity of the operator . Thus, Density-Optimal maintains the following invariant: a block is selected to be part of the output iff its density is as good or better than any of the blocks that not yet been selected to be part of the output. Overall, Density-Optimal ends up adding the blocks to the output in decreasing order of density. Density-Optimal terminates when the number of valid records in the output blocks selected is at least .

To retrieve the any- samples, we then load the blocks returned by Density-Optimal into memory and return all valid records seen in those blocks. If the total number of query results in those blocks are less than , we re-execute Density-Optimal on the blocks that have not been retrieved in previous invocations.

Fetch Optimization. Depending on the order of the blocks returned by Density-Optimal, the system may perform many unnecessary random I/O operations. For example, if Density-Optimal returns blocks , the system may read block , seek to block , and then seek back to block , resulting in expensive disk seeks. Instead, we can sort the blocks before fetching them from disk, thereby minimizing random I/O and overall query execution time.

1:Initialize , , ,
2:while  do
3:     
4:     for  do
5:          if  then
6:               
7:               
8:               
9:                               
10:     
11:     while   do
12:          
13:          
14:          
15:          if  then
16:               return
17:          else
18:                               
19:     
20:return
Algorithm 1 Density-Optimal

Guarantees. We now show that Density-Optimal retrieves the minimum set of blocks when optimizing for density.

Theorem 1 (Density Optimality)

Under the independence assumption, Density-Optimal returns the set of blocks with the highest densities with at least valid records.

Since Density-Optimal is a significant modification of the threshold algorithm the proof of the above theorem does not follow directly from prior work. {proof} The proof is composed of two parts: first, we demonstrate that Density-Optimal adds blocks to in the order of decreasing overall density; second, we demonstrate that Density-Optimal stops only when the number of valid records in is . The second part is obvious from the pseudocode (line 16). We focus on the first part. The first part is proven using an inductive argument. We assume that the blocks added to through th iteration satisfy the property and that of the th iteration is denoted as . We note that for the th iteration, . Consider the blocks that are part of at the end of line 10 in the th iteration. These blocks fall into two categories: either they were already part of in the th iteration, and hence have densities less than , or were added to in the th iteration, and due to monotonicity, once again have density less than . Furthermore, any blocks that have not yet been examined will have densities less than . Since all blocks that have been added at iteration or prior have densities greater than or equal to , all the blocks still under contention for adding to —those in or those yet to be examined—have densities below those in . Now, in iteration , we add all blocks in whose densities are greater than , in decreasing order. We know that all of these blocks have higher densities than all the blocks that have yet to be examined (once again using monotonicity). Thus, we have shown that any blocks added to in iteration are lower in terms of density than those added to previously, and are the best among the ones in and those that will be encountered in future iterations.

4.2 Locality-Optimal Algorithm

Our second algorithm, Locality-Optimal, prioritizes for locality rather than density, aiming to identify the shortest sequence of blocks that guarantee valid records. The naive approach to identify this would be to consider the sequence formed by every pair of blocks (along with all of the blocks in between)—leading to an algorithm that is quadratic in the number of blocks. Instead, Locality-Optimal, described below, is linear in the number of blocks.

High-level Intuition. Locality-Optimal moves across the sequence of blocks using a sliding window formed using a start and an end pointer, and eventually returns the smallest possible window with valid records. At each point, Locality-Optimal ensures that the window has at least valid records within it, by first advancing the end pointer until we meet the constraint, then advancing the start pointer until the constraint is once again violated. It can be shown that this approach considers all minimal sequences of blocks with valid records. Subsequently, Locality-Optimal returns the smallest such sequence.

Algorithmic Details. The pseudocode for the algorithm is listed in Algorithm 2. The Locality-Optimal algorithm operates on an array of values formed by applying the operator to the predicate DensityMaps , one block at a time. At the start, both pointers are at the value corresponding to the density of the first block. We move the end pointer to the right until the number of valid records between the two pointers is no less than ; at this point, we have our first candidate sequence containing at least valid records. We then move the start pointer to the right, checking if each sequence contains at least valid records, and continuing until the constraint of having at least valid records is once again violated. Afterwards, we once again operate on the end pointer. At all times, we maintain the smallest sequence found so far, replacing it when we find a new sequence that is smaller.

1:Initialize ,
2:Initialize
3:for  do
4:     
5:while  do
6:     while  and  do
7:          
8:                
9:     while  and  do
10:          if  then
11:               
12:                          
13:          
14:                
15:
16:return
Algorithm 2 Locality-Optimal

Guarantees. We now show that Locality-Optimal retrieves the minimum sequence of blocks when optimizing for locality.

Theorem 2 (Locality Optimality)

Under the independence assumption, Locality-Optimal returns the smallest sequence of blocks that contains at least valid records.

We demonstrate that for every block , Locality-Optimal considers the smallest sequence of blocks with valid records beginning at block at some point in the algorithm, thereby proving the above theorem. {proof} For , this is easy to see. The end pointer of Locality-Optimal starts at 1 and increases; the start pointer is not moved until a valid sequence of blocks is found, so by construction Locality-Optimal considers the smallest sequence of blocks starting at 1. For the remaining ’s we prove this by contradiction. Let the smallest sequence of blocks beginning at block end at , where ; we denote this sequence as . Now, let be the ending block for the smallest sequence of blocks starting at ; this sequence is denoted as . If , our Locality-Optimal algorithm considers the sequence as we move the end pointer forward (lines 6-8 in the pseudocode). Assume to the contrary that ; that is, the sequence is the smallest sequence of blocks starting at that has valid records. is a subsequence of , so must also have at least valid records. However, we already declared to be the smallest sequence of blocks starting at that has valid records, and thus a contradiction is found. Similar arguments can be made for all , so Locality-Optimal must consider the smallest sequence of blocks starting at block for every .

5 Hybrid ANY-K Algorithms

The two desired properties of density and locality can often be at odds with each other depending on the data layout; dense blocks may be far apart, and neighboring blocks may contain many blocks which have no valid records. In this section, we first present a cost model to estimate the I/O cost of a any- algorithm, and use it to design an any- algorithm that is I/O-optimal, providing the best balance between density and locality, and a hybrid algorithm, that selects between Density-Optimal and Locality-Optimal.

5.1 A Simple I/O Cost Model

Figure 2: I/O Cost Model for HDDs and SSDs

To set up the cost model for I/O for HDDs (Hard Disk Drives), we profile the storage system as described by Ruemmler et al. [47]. We randomly choose various starting blocks and record the time taken to fetch other blocks that are varying distances away (where distance is measured in number of blocks), for distance onwards. As shown in Figure 2, which uses a linear scale on the x-axis from x=2 until x=10, and then logarithmic scale after that, we observe that with block size equal to 256KB, the I/O cost is smallest when doing a sequential I/O operation to fetch the next block ( 2ms), and increases with the distance up to a certain maximum distance after which it becomes constant ( 12ms). We have overlaid our cost model estimate using a dashed blue line. More formally, for two blocks and , we model the cost of fetching block after block as follows:

When distance is less than , we use a simple linear fit for , using the Python numpy.polyfit function5

On the other hand, as shown in Figure 2, the I/O Cost Model for SSDs is different from the one we see for HDDs. Overall, we see a constant time ( 0.6ms) to fetch a block (overlaid in a dashed blue line) independent of the block distance.

5.2 IO-Optimal Algorithm

IO-Optimal considers both density and locality to search for the set of blocks that will provide the minimum I/O cost overall. Specifically, given our cost model, we can use dynamic programming to find the optimal set of blocks with valid records.

We define as the minimal cost to retrieve estimated valid records when block is amongst the blocks fetched. We define as the cost to retrieve the optimal set of blocks with estimated valid records when considering the first blocks. Finally, we denote as the estimated number of valid records inside block , derived, as before, using the computation. With this notation, we have:

The intuition is as follows: for each block that has estimated valid records, either the block can be in the final optimal set or not. If we decide to include block , the cost is the minimum cost amongst the following: (i) the smallest I/O cost of having samples at block where , plus the cost of jumping from block to (i.e., ), or (ii) the optimal cost at block , plus the random I/O cost of jumping from some block in the first blocks to block (i.e., .

For the second expression, if we exclude block , then the optimal cost is the same as the optimal cost at block . Consequently, the optimal cost at block is the smallest value in these two cases. The full algorithm is shown in Algorithm  3, where is some constant cost to fetch the first block.

1:Initialize .
2:for  do
3:     
4:     
5:for  =  do
6:     
7:     
8:for  =  do
9:     
10:     
11:for  =  do
12:     for  =  do
13:          
14:                
15:     for  =  do
16:          
17:                
18: sequence of blocks that result the cost in
19:return
Algorithm 3 IO-Optimal

Guarantees. We can show the following property.

Theorem 3 (IO-Optimal)

Under the independence assumption and the constructed cost model for disk I/O, IO-Optimal gives the blocks with optimal I/O cost for fetching any- valid records.

The proof is listed in full detail in Appendix A.1.

5.3 Hybrid Algorithm

Even though IO-Optimal is able to return the optimal I/O cost for fetching any- samples, its much higher computation cost (as we show in our experiments) makes it impractical for large datasets. We propose Hybrid which simply selects between the best of Density-Optimal and Locality-Optimal, using our I/O cost model, when a query is issued. Since Hybrid needs to run both algorithms to determine the set of blocks selected by each algorithm, using Hybrid would involve a higher up-front computational cost, but as we will see, leads to substantial performance benefits.

6 Aggregate Estimation

So far, our any- algorithms retrieve records without any consideration of how representative they are of the entire population of valid records. If these records are used to estimate an aggregate (e.g., a mean), there could be bias in this value due to possible correlations between the value and the data layout. While this is fine for browsing, it leaves the user unable to make any statistically significant claims about the aggregated value. To address this problem, we make two simple adjustments to extend our any- algorithms. First, we introduce a Two-Phase sampling scheme where we add small amounts of random data to our any- estimates. This random data is added in a fashion such that it does not significantly affect the overall running time, while at the same time, allowing us to “correct for” the bias easily. Second, we correct the bias by leveraging techniques from the survey sampling literature. Specifically, we leverage the Horvitz-Thompson [31] and ratio [42] estimators, as described below. Note that such approaches have been employed in other settings in approximate query processing [44, 41, 21, 32, 36], but their application to any- is new.

6.1 Two-Phase Sampling

We propose a Two-Phase sampling scheme, in which we collect a large portion of the requested samples using an any- algorithm, and collect the rest in a random fashion. We denote as the proportion of samples we retrieve using the any- algorithm, and as the proportion of samples we retrieve using random sampling. The user chooses the parameter upfront based on how much random sampling they wish to add. While a larger may reduce the number of total samples needed to obtain a statistically significant result, the time taken to retrieve random samples greatly exceeds the time taken to retrieve samples based on our any- algorithms. Therefore, needs to be carefully chosen; we experiment with different s in Section 9.

More formally, if we let be the set of blocks which have at least one valid record in them, we can describe Two-Phase sampling as follows: (1) Use an any- algorithm to choose the densest blocks from , and derive samples from . (2) Uniformly randomly select blocks from the remaining blocks, and derive samples from . Note that .

6.2 Unequal Probability Estimation

Within the Two-Phase sampling scheme, the probability a block is sampled is not uniform. Therefore, we must use an unequal probability estimator [56] and inversely weigh samples based on their selection probabilities. We introduce two different estimators for this: the Horvitz-Thompson estimator and the ratio estimator.

Requisite Notation

The goal of our two estimators is to estimate the true aggregate sum and the true aggregate mean of measure attribute given a query . We use to denote the aggregate sum of for block and for the total number of valid records for query . We can estimate using the DensityMaps.

The estimators inversely weigh samples based on their probability of selection. So, we define the inclusion probability as the probability that block is included in the overall sample:

For the samples that come from the any- blocks in , the probability of being chosen is always 1. After these blocks have been selected, a uniformly random subset of the remaining blocks are chosen to produce the random samples; thus the probability that these samples are chosen is .

We define the joint inclusion probability as the probability of selecting both blocks and for the overall sample:

Horvitz-Thompson Estimator

Using the Horvitz-Thompson [31] estimator, is estimated as:

(1)

As mentioned before, the sums are inversely weighted by their probabilities to account for the different probabilities of selecting blocks in . Based on , we can also easily estimate by dividing the size of the population:

(2)

The Horvitz-Thompson estimator guarantees us that both and are unbiased estimates: and . A full proof can be found in [31]. In addition, the Horvitz-Thompson estimator gives us a way to calculate the variances of of and , which represent the expected bounds of and :

(3)
(4)

Ratio Estimator

Although the Horvitz-Thompson estimator is an unbiased estimator, it is possible that the variances given by Equations 3 and 4 can be quite large if the aggregated variable is not well related to the inclusion probabilities [56]. To reduce the variance, the ratio estimator [42] may be used:

(5)
(6)

where is the number of valid records in block . The variances of and are given by:

(7)
(8)

While the ratio estimator is not precisely unbiased, in Equation 5, we see that the numerator is the unbiased Horvitz-Thompson estimate of the sum and the denominator is an unbiased Horvitz-Thompson estimate of the total number of valid records, so the bias tends to be small and decreases with increasing sample size.

We compare the empirical accuracies of these two estimators in Section 9, and demonstrate how our Two-Phase sampling technique, when employing these estimators, provides accurate estimates of various aggregates values.

7 Grouping and Joins

So far, we have assumed that all our sampling queries have the form dictated by the SELECT query given in Section 2, thus limiting our operations to a single database table, with simple selection predicates and no group-by operators. We now extend the any- sampling problem formulation and our algorithms to handle more complex queries that involve grouping and join operations.

7.1 Supporting Grouping

Rather than computing a simple any-, users may want to retrieve values per group, e.g., to compute an estimate of an aggregate value in each group.

Although a trivial way to do this would be to run a separate any- query per group, in this section we discuss an algorithm that can share the computation across groups in the common case when users want values per group.

Consider an any- query over a table with representing the predicate in the where clause. Let be the grouping attribute with values in . The formal goal of this grouped any- sampling can be stated as:

Problem 2 (grouped any- sampling)

Given a query defined by a predicate S on table , and a grouping attribute , the goal of grouped any- sampling is to retrieve any valid records for each group in as little time as possible.

Our basic approach is to create a combined density map, which takes into account every group in the group-by operation, and run the any- algorithm for all groups at once. This is akin to sharing scans in traditional databases.

In order to run our any- algorithms for all groups, we first define the combined density of the th block as multiplication of two factors: (1) the density of the th block with respect to predicate , and (2) the sum of the densities for group values in in the th block which still need to be sampled. The first factor has been discussed previously, while the second factor can be defined as:

(9)

where RPB is , is the number of samples already retrieved for group , and is the density of the th block for the value . The expression inside the function estimates the number of expected records in block for each group , but limits the estimate by the number of samples left to be retrieved for that group6. Thus, the combined density (), where is the density of the th block with respect to predicate , gives priority to groups which have had fewer than k samples retrieved so far, and groups which already have samples no longer contribute to the combined density. The RPB in front of the summation for acts a normalization factor to ensure that , and thereby , are both density values between 0 and 1.

With this combined density estimate , we can now construct an iterative any- algorithm for grouped sampling operations, similar to the algorithms in Sections 4 and 5. The main structure of the algorithm is as follows: (1) Update all densities using with . (2) Run one of the any- algorithms to retrieve blocks with the highest combined density. (3) Update the densities of the blocks as 0. (4) If samples still have not been retrieved for each group, go back to step (1). Since depends on the number of samples already retrieved, it must be updated periodically to ensure the correctness of the combined densities. The parameter controls the periodicity of these updates. The problem of setting becomes a trade-off between CPU time and I/O time. Setting updates the densities after every block retrieval; while this more correctly prioritizes blocks and is likely to lead to fewer blocks retrieved overall, there is a high CPU cost in updating the densities after each block retrieval. As increases, the CPU cost goes down due to less frequent updates, but the overall I/O cost is likely to go up since the combined densities are not completely up-to-date for each block retrieved. Although our iterative algorithm is not particularly complex, and globally IO-optimal solutions may perform better than our locally optimal solution, our algorithm has the advantage of simplicity of implementation and likely lower CPU overhead. We defer consideration of more sophisticated algorithms to future work.

Algorithm Details. The full algorithm for the grouping any- query is shown in Algorithm 4. In Section 4, was a single value representing the number of samples retrieved; for the grouping any- algorithm, is now an array of size where each entry represents the number of samples for that group. Every iteration consists of updating the combined density estimates or the priorities of blocks based on the number of samples retrieved (setting it to 0 if it has already been seen), and calling an any- algorithm with and the number of blocks desired . The algorithm updates the counts of and the algorithm only ends once every entry in is at least .

1:Initialize ,
2:while  do
3:     for  do
4:          if  then
5:               
6:          else
7:                               
8:     
9:     for  do
10:          for  do
11:                               
12:     
13:return
Algorithm 4 Group-by any- algorithm.

As we show in the next section, the key-foreign key join any- problem is essentially equivalent to this grouped any- formulation, and we evaluate the performance of our algorithm on these problems in Section 9.7.

Optimal Solution. As mentioned, the grouping any- solution uses a heuristic to find the best blocks to retrieve. However, an I/O optimal solution, similar to IO-Optimal from Section 5.2, could be derived using dynamic programming with a recursive relationship based on the notion of priority from Equation 9, , and the disk model from Section 5.2. Unfortunately, the resulting dynamic programming solution becomes a complex program of even more dimensions than the program from Section 5.2. Since we already showed in Section 9 that IO-Optimal incurs a prohibitively high CPU cost in exchange for its optimal I/O time, we chose not to pursue this avenue.

Multiple Groupings. For multiple group-by attributes, we simply extend the above formula to account for every possible combination of values from the different groupings. For example, if we have two group-by attributes and , we can specify our updated notion of density with

7.2 Supporting Key-Foreign Key Joins

Consider any- sampling on the result of a key-foreign key join between two tables and , where is the primary key in , and is the foreign key in . Similar to grouping, the formal definition of join any- sampling can be defined as:

Problem 3 (join any- sampling)

Given a query defined by a predicate and a join over tables and , on primary key from table and foreign key from table , the goal of join any- sampling is to retrieve any valid joined records for each join value in as little time as possible.

For example, if we want to join on a “departments” attribute, samples would be retrieved for each department.

Since we assume is the primary key, and therefore unique, the join any- sampling problem can be reduced to finding any valid records in table for each join value . However, this is the exact same problem as the grouped any- sampling problem in which the group values are . Thus, we can use the algorithm described in the previous section, using the values of as the grouping value on the foreign key table .

In this way, NeedleTail is able to best indicate the blocks that can be retrieved to minimize the overall time for joins. We evaluate our join any- algorithm in Section 9.7. We leave optimizations for other join variants as future work.

8 System Design

Figure 3: NeedleTail Architecture

We implemented our DensityMaps, any- algorithms, and aggregate estimators in a system called NeedleTail. NeedleTail is developed as a standalone browsing-based data exploration engine, capable of returning individual records as well as estimating aggregates. NeedleTail can be invoked by various frontends, e.g., SQL IDEs or interfaces such as Tableau or Excel. Figure 3 depicts the overall architecture of NeedleTail. It includes four major components: the any- module, the random sampling module, the index, and the disk access module. The any- module receives queries from the user and executes our any- algorithms from Sections 4, 5, and 7 to return any- browsing samples as quickly as possible. For aggregate queries, the random sampling module is used in conjunction with the any- module to perform the Two-Phase sampling from Section 6. The index contains the DensityMaps and sorted DensityMaps. Finally, the disk access module is in charge of interacting with the buffer pool to retrieve blocks from disk. Since DensityMaps are a lossy compression of the original bitmaps, it is possible that some blocks with no valid records may be returned; these blocks are filtered out by the disk access module.

Our NeedleTail prototype is currently implemented in C++ using about 5000 lines of code. It is capable of reading in row-oriented databases with int, float, and varchar types and supports Boolean-logic predicates. Although the current implementation is limited to a single machine, we plan to extend NeedleTail to run in a distributed environment in the future. We believe the collective memory space available in a distributed environment will allow us to leverage the DensityMaps in even better ways.

9 Performance Evaluation

In this section, we evaluate NeedleTail, focusing on runtime, memory consumption, and accuracy of estimates. We show that our DensityMap-based any- algorithms outperform any “first-to--samples” algorithms using traditional OLAP indexing structures such as bitmaps or compressed bitmaps on a variety of synthetic and real datasets. In addition, we empirically demonstrate that our Two-Phase sampling scheme is capable of achieving as accurate an aggregate estimation as random sampling in a fraction of the time. Then, we demonstrate that our join any- algorithms provide substantial speedups for key-foreign key joins. We conclude the section with an exploration into the effects of different parameters on our any- algorithms.

9.1 Experimental Settings

We now describe our experimental workload, the evaluated algorithms, and the experimental setup.

Synthetic Workload: We generated 10 clustered synthetic datasets using the data generation model described by Anh and Moffat [11]. Every synthetic dataset has 100 million records, 8 dimension attributes, and 2 measure attributes. For the sake of simplicity, we forced every dimension attribute to be binary (i.e., valid values were either 0 or 1), and with measure attributes being sampled from normal distributions, independent of the dimension attributes. For each dimension attribute, we enforced an overall density of 10%; the number of 1’s for any attribute was 10% of the overall number of records. Since we randomly generated the clusters of 1’s in each attribute value, we ran queries with equality-based predicates on the first two dimensional attributes (i.e., and ). Note that this does not always result in a selectivity of 10% since the records whose may not have .

Real Workload: We also used two real datasets.

  • Airline Dataset [1]: This dataset contained the details of all flights within the USA from 1987–2008, sorted based on time. It consisted of 123 million rows and 11 attributes with a total size of 11 GB. We ran 5 queries with 1 to 3 predicates on attributes such as origin airport, destination airport, flight-carrier, month, day of week. For our experiments on error (described later), we estimated the average arrival delay, average departure delay, and average elapsed time for flights.

  • NYC Taxi Dataset [5]: This dataset contained logs for a variety of taxi companies in New York City for the years 2014 and 2015. The dataset as provided was first sorted by the year; within each year, it was first sorted on the three taxi types and then on time. It consisted of 253 million rows and 11 attributes with a total size of 21 GB. We ran 5 queries with 1 to 2 predicates on attributes including pickup location, dropoff location, time slots, month, passenger count, vendors, and taxi type. For our experiments on error, we estimated the average fare amount and average distance traveled for the chosen trips.

Algorithms: We evaluated the performance of the three any- algorithms presented in Section 4 and 5: (i) IO-Optimal, (ii) Density-Optimal, (iii) Locality-Optimal, and (iv) Hybrid We compared our algorithms against the following four “first-to--samples” baselines. Bitmap-Scan and Disk-Scan are representative of how current databases implement the LIMIT clause.

  • Bitmap-Scan: Assuming we have bitmaps for every predicate, we use bitwise AND and OR operations to construct a resultant bitmap corresponding to the valid records. We then retrieve the first records whose bits are set in this bitmap.

  • Lossy-Bitmap [62]: Lossy-Bitmap is a variant of bitmap indexes where a bit is set for each block instead of each record. For each attribute value, a set bit for a block indicates that at least one record in that block has that attribute value. During data retrieval, we perform bitwise AND or OR operations and on these bitmaps then fetch records from the first few blocks which their bit set. Note that this is equivalent to a DensityMap which rounds its densities up to 1 if it is .

  • EWAH: This baseline is identical to Bitmap-Scan, except the bitmaps are compressed using the Enhanced Word-Aligned Hybrid (EWAH) technique [40] implemented using  [3]

  • Disk-Scan: Without using any index structures, we continuously scan the data on disk until we retrieve valid records.

For our experiments on aggregate estimation, we compared our Two-Phase sampling algorithms against the baseline Bitmap-Random, which is similar to Bitmap-Scan, except that it selects random records among all the valid records. We describe our setup for the join any- experiments in Section 9.7.

Figure 4: Query runtimes for the synthetic workload on a HDD.

Setup: All experiments were conducted on a 64-bit Linux server with 8 3.40GHz Intel Xeon E3-1240 4-core processors and 8GB of 1600 MHz DDR3 main memory. We tested our algorithms with a 7200rpm 1TB HDD and a 350GB SSD. For each experimental setting, we ran 5 trials (30 trials for the random sampling experiments) for each query on each dataset. In every trial, we measured the end-to-end runtime, the CPU time, the I/O time, and the memory consumption. Before each trial, we dropped the operating system page cache and filled it with dummy blocks to ensure the algorithms did not leverage any hidden benefits from the page cache. To minimize experimental variance, we discarded the trials with the maximum and minimum runtime and reported the average of the remaining. Finally, after empirically testing a few different block sizes, we found 256KB to be a good default block size for our datasets: the block size does not significantly impact the relative performance of the algorithms.

Figure 5: Query runtimes for airline workload on a HDD.
Figure 6: Query runtimes for taxi workload on a HDD.

9.2 Query Execution Time

Summary: In the synthetic datasets on a HDD, our Hybrid any- sampling algorithm was on average 13 faster than the baselines. For the real datasets, Hybrid performed at least as well as the baselines for every query, and on average was 4 and 9 faster for queries on HDDs and SSDs respectively.

Figure 7: Query runtimes for airline workload on a SSD.
Figure 8: Query runtimes for taxi workload on a SSD.

Synthetic Experiments on a HDD. Figure 4 presents the runtimes for Hybrid, Density-Optimal, Locality-Optimal, and the four baselines for varying sampling rates. (We will evaluate IO-Optimal later on.) Sampling rate is defined to be the ratio of divided by the number of valid records. Since the queries can have a wide variety in the number of valid records, we decided to plot the impact on varying sampling rate rather than . (Results for varying are similar.) The bars in the figure above represent the average runtimes for five sampling rates over 10 synthetic datasets. Note that the figure is in log-scale.

Regardless of the sampling rate, Density-Optimal, Hybrid, and Locality-Optimal significantly outperformed Bitmap-Scan, Lossy-Bitmap, EWAH, and Disk-Scan, with speedups of an order of magnitude. For example, for a sampling rate of 1%, Density-Optimal, Locality-Optimal, and Hybrid took 74ms, 45ms, and 58ms on average respectively, while Bitmap-Scan, Lossy-Bitmap, EWAH, and Disk-Scan took 647ms, 624ms, 662ms and 630ms on average respectively. Thus, our any- algorithms are more effective at identifying the right sequence of blocks that contain valid records than the baselines which do not optimize for any-—the baselines are subject to the vicissitudes of random chance: if there are large number of valid records early on, then they will do well, and not otherwise. This is despite the fact that Bitmap-Scan and EWAH store more fine-grained information than our algorithms, and are therefore able to effectively skip over blocks without valid records.

There was no consistent winner between Density-Optimal and Locality-Optimal across sampling rates and queries, but Hybrid always selected the faster algorithm from the two and thus had an average speedup of 13 over the baselines. Despite that, Hybrid’s performance on lower sampling rates (0.1%, 1%) is a bit worse than Density-Optimal and Locality-Optimal, since it has to run both algorithms and pick the better one: but this difference in performance is small—around 10ms. From 5% onwards, Hybrid’s performance is clearly better than Density-Optimal and Locality-Optimal, since the increase in computation time is dwarfed by the improvement in I/O time.

Real Data Experiments on a HDD. Figures 5 and 6 show the runtimes of our algorithms over 5 diverse queries for the airline and taxi workloads respectively. For each query and sampling rate, we normalized the runtime of each algorithm by the largest runtime across all algorithms, while also reporting actual runtime (in ms) taken by Hybrid and the maximum runtime. We omitted Disk-Scan since Disk-Scan was found to be have the worst runtime in the previous experiment, and similarly performs poorly here. For the real workloads, we noticed that the runtimes of the queries were much more varied, so we report the average runtime for each query separately.

For the airline workload, we noticed that our any- algorithms consistently outperformed the bitmap-based baselines: Density-Optimal had a speedup of up to 8 compared to Bitmap-Scan and EWAH, while Locality-Optimal had a speedup of up to 7. Across all queries, when sampling rate equals 1%, Density-Optimal and Locality-Optimal were on average 3 and 5 faster than Bitmap-Scan and EWAH, despite having a much smaller memory footprint (Section 9.3). For example for Q3, which had two predicates on month and origin airport, the block with the highest density contained 1% samples already. Moreover, since the airline dataset is naturally sorted on time attributes (e.g., year, month), the valid tuples were more likely to clustered in a few number of blocks. Therefore, compared with Locality-Optimal, Density-Optimal fetched up to 10% less blocks, resulting in less query execution time than Locality-Optimal in all of cases. For the small additional cost of estimating the sequence of blocks for both Locality-Optimal and Density-Optimal, Hybrid ends up always selecting the faster algorithm in both this and the taxi workload, with an average speedup of 4. For example, for Q4 with 1% sampling rate Hybrid’s time is closer to Density-Optimal, and half of that of Locality-Optimal.

We noticed a different (and somewhat surprising) trend for the taxi workload. Here, Hybrid continued to do well, and much better than the worst algorithm on every setting, with an average speedup of 4 compared to the baselines. Similarly, Locality-Optimal performed similar or better than the baselines for every experiment. However, on multiple occasions, we found that Density-Optimal was slower than the baselines, and was the worst algorithm, e.g., in Q3 and Q5. Upon closer examination, we found that Density-Optimal did in fact retrieve the fewest number of blocks for every query. However, the taxi dataset was much larger than the airline dataset, so the blocks were more spread out, and the time to seek from block to block went up significantly. As a result, we found the locality-favoring Locality-Optimal to perform better on a HDD where seeks were expensive. To further exacerbate the issue, we found that the taxi workload also had a much more uniform distribution of tuples; the tuples that satisfied query predicates (which were not based on taxi type) were spread fairly uniformly across the dataset. In some sense, this made the dataset “adversarial” for density-based schemes. In other words, it is hard to conclude either Density-Optimal or Locality-Optimal is better than another, given their performance depends on the distribution of valid tuples of a given ad-hoc query—and it is therefore safer to use Hybrid to pick between the two.

Real Data Experiments on a SSD. We also ran the same workload on SSD; SSDs have random I/O performance that is comparable to sequential I/O performance. The results are depicted in Figures 7 and 8. We omit Hybrid, since Hybrid always selects Density-Optimal over Locality-Optimal due to the fact that Density-Optimal fetches the smallest number of blocks. Overall, the performance of Density-Optimal is much faster than the bitmap-based baselines, with average speedups of 14 and 6 in the airline and taxi workload respectively. There were two exceptions: Q1 (10%) in airline and Q3 in taxi, where the total number of blocks fetched by Density-Optimal, Bitmap-Scan, Lossy-Bitmap, and EWAH were similar. In this uncommon situation, even though Density-Optimal has the lowest I/O time, the CPU cost of checking for valid records in each block was slightly higher, thus its runtime was a little higher than Bitmap-Scan and EWAH.

9.3 Memory Consumption

Summary: DensityMaps consumed on average 48 less memory than the regular bitmaps and 23 less memory than EWAH-compressed bitmaps.

Dataset Disk Usage # Tuples Cardinality Bitmap EWAH LossyBitmap DensityMap
Synthetic 7.5 GB 100M 16 190.73MB 182.74MB 0.06MB 3.73MB
Taxi 21 GB 253M 64 1936.99MB 663.63MB 0.65MB 41.63MB
Airline 11GB 123M 805 11852.33MB 744.05MB 3.98MB 254.72MB
Table 2: Memory consumption of index structures.

Table 2 reports the amount of memory used by DensityMaps compared to the other three bitmap baselines. We observed that DensityMaps were very lightweight and consumed around 51, 47, and 47 less memory than uncompressed bitmaps respectively in the three datasets. Even with EWAH-compression, we observed an almost 49 reduction in size for the taxi dataset for DensityMaps relative to EWAH. In the airline dataset, since the selectivity of each attribute value is low, EWAH compressed the bitmaps much better than in the other two datasets. Still, EWAH consumed 3 more memory than DensityMap. Lastly, since Lossy-Bitmap requires only one bit per block while DensityMap is represented as a 64-bits double per block respectively, Lossy-Bitmap unsurprisingly consumed less memory than DensityMap. However, as we showed in Section 9.2, the smaller memory consumption incurred a large cost in query latency due to the large number of false positives (e.g., Q3 with sampling rate 10% in Figure5); especially when the number of predicates is large and exhibit complex correlations. In comparison, the DensityMap-based any- algorithms were orders of magnitude faster than the baselines, while still maintaining a modest memory footprint ( of original dataset).

9.4 IO-Optimal Performance

Summary: IO-Optimal had up to 3.9 faster I/O time than Hybrid and the best I/O performance among all the algorithms described above. However, its large computational cost made it impractical for real datasets.

For the evaluation of IO-Optimal, we used a smaller synthetic dataset of 1 million records and a block size of 4KB, and conducted the evaluation on a HDD. We compared its overall end-to-end runtime, CPU time, and I/O time with every other algorithm, and found that it consistently had the best I/O times. However, we found that computational cost of dynamic programming in IO-Optimal outweighed any benefits from the shorter I/O time. Consequently, we found IO-Optimal to be impractical for larger datasets. Figure 9 shows both the overall times and I/O times for IO-Optimal and Hybrid for varying sampling rates.

Figure 9: Overall and I/O time for IO-Optimal and Hybrid.

9.5 Time vs Error Analysis

Summary: Compared to random sampling using bitmap indexes, our Two-Phase sampling schemes that mix samples from any- sampling algorithms with a small percentage of random cluster samples attained the same error rate in much less time.

(a) NYC-Taxi - Ratio Estimator
(b) Airline - Ratio Estimator
(c) NYC-Taxi - HT Estimator
(d) Airline - HT Estimator
Figure 10: Time vs empirical error.

Using the Two-Phase sampling techniques in Section 6, we can obtain estimates of aggregate values on data; here we experiment with random samples, and use the Density-Optimal algorithm, since it ended up performing the most consistently well across queries and workloads, for SSDs and HDDs. We compared these results with pure random sampling (Bitmap-Random) using bitmaps on a HDD. We used the same set of queries as in Section 9.2. For each query, we varied the sampling rate and measured the runtime and the empirical error of the estimated aggregate with respect to the true average value. Figure 10 depicts the average results for both the Horvitz-Thompson estimator and the ratio estimator. In log scale. We’ll start with the taxi dataset and the ratio estimator. Figure (a)a shows that if all the sampling schemes are allowed to run for 500ms (commonly regarded as the threshold for interactivity), Density-Optimal, Two-Phase sampling with , Two-Phase sampling with , and Bitmap-Random have average empirical error rates of 29.64%, 4.83%, 3.66% and 19.64%, respectively; the corresponding number of the samples retrieved are 11102, 7977, 5684, 35 respectively. Thus, the Two-Phase sampling schemes are able to effectively correct the bias in Density-Optimal, while still retrieving a comparable amount of samples. Furthermore, note that Bitmap-Random suffers from the same problem as Bitmap-Scan in large memory consumption. In contrast, even though Density-Optimal was not the fastest algorithm in the taxi workload, our Two-Phase sampling algorithms cluster sample at the block level and only need access to the much more compressed DensityMaps.

The behavior on the airline workload is somewhat different: here we find that Density-Optimal performs better than the Two-Phase sampling scheme with the ratio estimator for the initial period until about 100ms, after which the Two-Phase sampling schemes perform better than Density-Optimal and Bitmap-Scan. We found this behavior repeated across other queries and trials: Density-Optimal sometimes ends up having very low error (like in Figure (b)b), and sometimes fairly high error (like in Figure (a)a), but the Two-Phase sampling schemes consistently achieve low error relative to Density-Optimal. This is because Density-Optimal’s accuracy is highly dependent on the correlation between the data layout and the attribute of interest, and can sometimes lead to highly biased results. At the same time, the Two-Phase sampling schemes return much more samples and much more accurate estimates than Bitmap-Random, effectively supporting browsing and sampling at the same time.

Between the Horvitz-Thompson estimator and the ratio estimator, the ratio estimator often had higher accuracies. As explained in Section 6, the ratio estimator works quite well in situations where aggregation estimate is not correlated with the block densities. We found this to be the case for both the airline and taxi workloads, so the ratio estimator helped for both these workloads.

9.6 Effect of Parameters

To explore the properties of our any- algorithms, we varied various parameters and noted their effect on overall runtimes for synthetic workloads. Varied parameters included: (i) data size, (ii) number of predicates, (iii) density, (iv) block size, and (v) granularity.

Data Size: We varied the synthetic dataset size from 1 million to 1 billion, but we found that the overall runtimes of our any- algorithms remained relatively the same. Our algorithms return only a fixed number of samples and explicitly avoid reading the entire dataset, so it makes sense that the runtimes stay consistent even when the data size increases.

Number of Predicates: As we increased the number of predicates in a query, we saw that overall runtimes increase as well. Since our predicates were combined using ANDs, an increase in the number of predicates meant a decrease in the number of valid records per block. Therefore, both Density-Optimal and Locality-Optimal needed to fetch more blocks to retrieve the same number of samples, and this caused an increase in the overall runtime.

Density: As we increased the overall density of valid records in the dataset, the runtimes for our any- algorithms got faster. As the overall density increased, the average density per block also increased, so our any- algorithms could retrieve fewer blocks to achieve the same number of samples.

Block Size: We tried varying the block sizes of our datasets from 4KB, to 256KB, to 1MB, to 2MB. We found that as we decreased the block sizes, the runtimes for Density-Optimal increased drastically because smaller block sizes meant that more random I/O was being done. However, we did not see any definite correlation as we increased the block size. Although larger block sizes do bias the algorithms toward more locality, they also mean density information is collected at a coarser granularity. We suspect that this tradeoff prevented us from seeing any improvements in performance with increased block size.

9.7 Key-Foreign Key Join Performance

Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 2465.67 215.678 67.6398 63.7114 73.7204
0.5% 2632.32 1049.92 307.534 300.023 301.193
1.0% 2761.74 1424.12 616.090 593.258 591.201
Table 3: Query runtimes (in ms) on SSD for join operations on foreign key tables with 10 million rows.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 1266.06 110.63 33.59 32.38 68.16
0.5% 1359.80 548.88 154.34 159.45 153.65
1.0% 1418.73 737.64 308.00 299.78 307.44
Table 4: Query runtimes (in ms) on SSD for join operations on foreign key tables with 5 million rows.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 12415.26 1114.80 417.18 347.93 314.27
0.5% 13019.96 5193.80 2068.75 1708.97 1476.64
1.0% 13651.02 7143.95 4167.87 3478.81 2971.74
Table 5: Query runtimes (in ms) on SSD for join operations on foreign key tables with 50 million rows.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 1930.30 134.84 47.62 46.94 95.57
0.5% 2069.74 628.99 223.42 228.43 225.59
1.0% 2197.38 904.28 453.56 452.34 447.00
Table 6: Query runtimes (in ms) on SSD for join operations on a foreign key table with 5 attributes.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 7223.28 926.99 278.35 217.47 165.21
0.5% 7348.53 4460.22 1422.41 1108.15 886.90
1.0% 7573.63 5758.62 2859.31 2220.21 1770.48
Table 7: Query runtimes (in ms) on SSD for join operations on a foreign key table with 50 attributes.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 2220.28 226.89 62.43 65.48 74.75
0.5% 2389.59 1110.59 312.64 303.34 304.42
1.0% 2509.94 1478.30 625.06 605.06 606.28
Table 8: Query runtimes (in ms) on SSD for join operations on a primary key table with 5 attributes.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 5177.96 245.09 70.56 75.44 85.31
0.5% 5453.41 1172.76 361.24 350.30 351.10
1.0% 5692.40 1618.83 722.81 696.12 698.50
Table 9: Query runtimes (in ms) on SSD for join operations on a primary key table with 50 attributes.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 2528.40 68.33 60.85 58.91 68.16
0.5% 2699.45 326.55 280.79 278.06 279.73
1.0% 2843.45 651.10 558.19 547.48 559.85
Table 10: Query runtimes (in ms) on SSD for join operations on 5 unique join attribute values.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 2445.25 1250.18 2148.35 1845.65 1603.79
0.5% 2597.23 1513.81 2304.90 2008.87 1759.71
1.0% 2757.08 1704.73 2421.01 2116.32 1873.40
Table 11: Query runtimes (in ms) on SSD for join operations on 50 unique join attribute values.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 1527 1527 1284 1284 1284
0.5% 1527 1527 1284 1284 1284
1.0% 1527 1527 1284 1284 1284
Table 12: Number of blocks fetched for join operations on 50 unique join attribute values.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 62154.22 142.40 56.50 54.40 78.98
0.5% 63513.28 664.69 273.63 259.14 252.80
1.0% 64099.30 1308.80 542.67 522.12 510.73
Table 13: Query runtimes (in ms) on SSD for join operations on a foreign key with Zipf distribution parameter of 1.5.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 1673.12 1077.73 1783.76 1706.39 1647.18
0.5% 1704.48 1120.75 1823.00 1742.84 1711.86
1.0% 1742.57 1186.66 1841.09 1734.92 1721.47
Table 14: Query runtimes (in ms) on SSD for join operations on a foreign key with Zipf distribution parameter of 5.
Sampling Rate Shared-Scan Bitmap-Combined Density-Combined (=5) Density-Combined (=10) Density-Combined (=50)
0.1% 1526 1526 1525 1525 1525
0.5% 1527 1527 1525 1525 1525
1.0% 1527 1527 1525 1525 1525
Table 15: Number of blocks fetched for join operations on a foreign key with Zipf distribution parameter of 5.

Summary: Our iterative join any- algorithm has an average speedup of 3 compared to existing baselines not optimized for any-.

We now evaluate the extension of our any- algorithms to key-foreign key joins from Section 7. We compare the performance of our join algorithm with two baselines: (1) Shared-Scan: a single scan of the foreign key table, shared across different join attribute values, with no indexes and (2) Bitmap-Combined: a single scan of the foreign key table, shared across different join attribute values, with bitmap indexes to skip to the next valid record which can serve as a sample. In both cases, the algorithms terminate as soon as samples for each join value are found, and a hash join is used to combine the foreign key record with the primary key record, with the hash table constructed on the primary key table. In Bitmap-Combined, bitmaps for different join values were first combined using OR, then once a join attribute value had reached samples, its bitmap is subtracted from the combined bitmap.

For our join any- algorithm, we ran the iterative algorithm presented in Section 7 and used Density-Optimal as our any- algorithm in each iteration. We varied the number of blocks retrieved per iteration () before the updates to the combined densities, and evaluated its impact on the overall runtime. As with Shared-Scan and Bitmap-Combined, we used a hash join with the hash table constructed on the primary key table.

All experiments were run with a synthetic dataset using a SSD drive. The synthetic dataset had two tables: one for the primary key and one for the foreign key. All attributes for both tables were of int type. The primary key table’s th row had as the its unique primary key value, and the foreign key table’s foreign key attribute values were generated using a Zipf distribution. Note that this means there were some foreign keys which did not match with any primary key. In addition, we varied the following parameters: (1) the number of rows in the foreign key table, (2) the number of attributes in the foreign key table, (3) the number of attributes in the primary key table, (4) number of unique values for the join attribute in the primary key table (and thereby the number of the rows in the primary key table), and (5) the Zipf distribution parameter. Each experimental setup was run 5 times and the means of these average runtimes are reported in this paper. The standard deviation between the runtimes were less than 1% for the experiments, so they are not reported.

Table 3 shows the overall runtimes for different sampling rates with the following parameters: (1) 10 million rows for the foreign key table, (2) 10 foreign key key table attributes, (3) 10 primary key table attributes, (4) 10 unique join attribute values, and (5) 2 as the Zipf distribution parameter. The lowest runtimes for each sampling rate are highlighted in bold and the speedup relative to Bitmap-Combined (the better of the two baselines) are indicated in parentheses. As shown, our Density-Combined was the fastest algorithm for each sampling rate, with a 3 speedup compared to Bitmap-Combined and an order of magnitude difference with respect to Shared-Scan. This was largely due to the fact that Density-Combined retrieved far fewer blocks than either Bitmap-Combined or Shared-Scan. For a sampling rate of 0.05%, Density-Combined (=10) only retrieved 190 blocks, while Bitmap-Combined retrieved 1259 and Shared-Scan retrieved 1527. Since we used a SSD, this always resulted in a lower runtime. Note that had these experiments been run on a HDD, we would have used Hybrid as our any- algorithm.

We found that varying had a rather minimal impact on the overall runtime for values between a certain range (5 - 50 for this case). Outside of this range (e.g., ), had a larger impact on performance, but the overall runtime was still lower than either Bitmap-Combined or Shared-Scan.

With the experimental setup used for Table 3, we varied each of the 5 parameters mentioned before one at a time to see their effect on overall runtime performance. Other than the varied parameter for each experiment, the other parameters were set to be the same as those used in the experiment for Table 3.

(1) Rows in Foreign Key Table. First, we wanted to see whether our iterative join any- algorithm could scale to different dataset sizes. Since, the size of the primary key table is fixed to be the number of unique join attribute values, we first focused on varying the number of rows in the foreign key table. Tables 4 and 5 show the results for 5 and 50 million rows in the foreign key table respectively. As we can see Density-Combined still provided a speedup of 2-3 over Bitmap-Combined, suggesting that our join any- algorithm can scale. The reasons for the speedup were the same as for Table 3; much fewer blocks were retrieved by Density-Combined than Bitmap-Combined.

(2) Number of Attributes in Foreign Key Table. Given the row-oriented layout of our data, a change in the number of attributes in the foreign key table meant a change in the number of records per block for the foreign table. This parameter allowed us to observe how Density-Combined would adapt to different numbers of records in the blocks. Furthermore, the number of attributes also affected the size of the dataset, so this experiment served as an additional check on how well Density-Combined scaled with size. Tables 6 and 7 show the results for 5 and 50 attributes in the foreign key table respectively. Density-Combined still remained faster than either Shared-Scan or Bitmap-Combined, and we saw that as the number of attributes increased, the speedup became more pronounced as well. This was due to Density-Combined being more selective with the blocks it chose to retrieve. When there were fewer records per block, the choice of the blocks had a large impact on the number of blocks fetched, making Density-Combined more suited for this case than either Bitmap-Combined or Shared-Scan.

(3) Number of Attributes in Primary Key Table. We hypothesized that varying the number of attributes in the primary key table would have minimal effect on the overall runtime. The only impact this variable should have had was on the time it took to copy the record in the primary key table for the output. Tables 8 and 9 show the results for 5 and 50 attributes in the primary key table respectively. As we expected, this parameter did have a minimal impact on the overall performance of the algorithms, with runtimes extremely similar to Table 3. We believe given the magnitude of the difference, the runtime discrepancies between Table 3 and 8 were due to experimental noise.

(4) Number of Unique Join Attribute Values. We wanted to see how Density-Combined would perform with different number of join values, so we altered the number the number of unique join attribute values, and thereby also increased the number of rows in the primary key table. Table 10 and 11 show the results for 5 and 50 unique join attribute values respectively. As expected, Density-Combined outperformed Bitmap-Combined for 5 unique join values. Interestingly, Bitmap-Combined was more performant than Density-Combined for 50 unique join values. Upon closer examination, we found that although Bitmap-Combined was faster than Density-Combined, Density-Combined was still retrieving fewer blocks as shown by Table 12. However, compared to the other experiments, Density-Combined was returning a larger ratio of blocks with respect to Bitmap-Combined. Due to the Zipf distribution nature of the foreign key values, records with a foreign key value greater than 10 were scarce, and more spread out among the blocks. This meant that a greater number of blocks would have to be returned to satisfy the users’ join any- query (regardless of any algorithm used). Since we ran the all these experiments with Density-Optimal, we believe that Density-Combined’s lack of awareness caused a greater overall runtime.

(5) Zipf Distribution Parameter. The final variable of interest was the Zipf distribution parameter used to generate the attribute values for the foreign key. Table 13 and 14 show the results for a Zipf distribution parameter of 1.5 and 5 respectively. Density-Combined outperformed Shared-Scan and Bitmap-Combined as usual for a Zipf distribution parameter of 1.5, but Bitmap-Combined once again outperformed Density-Combined for a Zipf distribution parameter of 5. A greater Zipf distribution parameter forced the foreign keys to be more heavily concentrated around the lower numbers, thus causing the higher numbers to become more scarce. Thus, a similar behavior to the experiment with 50 unique join attribute values was exhibited. We can see this from Table 15, in which Density-Combined still retrieved the “fewest” number of blocks, but it was only 1 or 2 less than Bitmap-Combined and Shared-Scan. When retrieving around the same number of blocks, the locality-unaware Density-Combined expectedly performed worse than Bitmap-Combined.

Overall. These experiments suggest that the iterative join any- algorithm is most effective when a smaller number of blocks needs to be fetched. Luckily, most browsing cases fit into such a category (e.g., a user is not likely to want 10 samples for each of the 50 unique join values), so NeedleTail ends up being a good fit for the browsing use case. Nevertheless, we are still in the midst of trying to make Density-Combined more aware of locality so that it can handle any situation.

10 Related Work

Prior work related to N