Optimally Leveraging Density and Locality to Support LIMIT Queries
Abstract
Existing database systems are not optimized for queries with a LIMIT clause—operating instead in an allornothing manner. In this paper, we propose a fast LIMIT query evaluation engine, called NeedleTail, aimed at letting analysts browse a small sample of the query results on large datasets as quickly as possible, independent of the overall size of the result set. NeedleTail introduces density maps, a lightweight inmemory indexing structure, and a set of efficient algorithms (with desirable theoretical guarantees) to quickly locate promising blocks, trading off locality and density. In settings where the samples are used to compute aggregates, we extend techniques from survey sampling to mitigate the bias in our samples. Our experimental results demonstrate that NeedleTail returns results 4 faster on HDDs and 9 faster on SSDs on average, while occupying up to 23 less memory than existing techniques.
1
1 Introduction
Users of databases are frequently interested in retrieving only a small subset of records that satisfy a query, e.g., by specifying a LIMIT clause in a SQL query. Although there has been extensive work on topk queries that retrieve the largest or smallest results in a result set, or that tries to provide a random sample of results, there has been relatively little work on socalled any queries that simply retrieve a few results without any requirements about the ordering or randomness of the results.
Any has many applications in exploratory data analysis.
When exploring new or unfamiliar datasets,
users often issue arbitrary queries,
examine a subset or sample (a “screenful”) of records,
and incrementally and reissue their queries
based on observations from this sample [59, 29].
To support this form of exploratory browsing,
many popular SQL IDEs implicitly
limit the number of result records displayed, recognizing
the fact that users often do not need to,
nor do they have the time to see all of the result records.
For example,
phpMyAdmin
LIMIT clauses provide a way to express any in traditional databases, and are supported by most modern database systems [52]. Surprisingly, despite their fundamental importance, there has been little work in executing LIMIT queries efficiently. Existing databases support LIMIT/any by simply executing the entire query, and returning the first result records as soon as they are ready, via pipelining. This is often not interactive on large datasets, especially if the query involves selective predicates. To address this limitation, in this paper, we develop methods that allow us to quickly identify a subset of records for arbitrary any queries.
Specifically, we develop indexing and query evaluation techniques
to address the any problem.
Our new data exploration engine, NeedleTail
We now describe these contributions and the underlying challenges in more detail.
(i) Density Maps: A Lightweight Indexing Scheme. Inspired by bitmap indexes, which are effective for arbitrary readonly queries, but are somewhat expensive to store, we develop a lightweight indexing scheme called density maps. Like bitmaps, we store a density map for each value of each attribute. However, unlike bitmaps, which store a column of bits for each distinct value of an attribute, density maps store an array of densities for each block of records, where the array contains one entry per distinct value of the attribute. This representation allows density maps to be 34 orders of magnitude smaller than bitmaps, so we can easily store a density map for every attribute in the data set, and these can comfortably fit in memory even on large datasets. Note that prior work has identified other lightweight statistics that help rule out blocks that are not relevant for certain queries [62, 43, 38, 15, 24]: we compare against the appropriate ones in our experiments and argue why they are unsuitable for any in Section 10.
(ii)(iii) Density Maps: Density vs. Locality. In order to use density maps to retrieve records that satisfy query constraints, one approach would be to identify blocks that are likely to be “dense” in that they contain more records that satisfy the conditions and preferentially retrieve those blocks. However, this densitybased approach may lead to excessive random access. Because random accesses—whether in memory, on flash, or on rotating disks—are generally orders of magnitude slower than sequential accesses, this is not desirable. Another approach would be to identify a sequence of consecutive blocks with records that satisfy the conditions, and take advantage of the fact that sequential access is faster than random access, exploiting locality. In this paper, we develop algorithms that are optimal from the perspective of density (called DensityOptimal) and from the perspective of locality (called LocalityOptimal).
Overall, while we would like to optimize for both density and locality, optimizing for one usually comes at the cost of the other, so developing the globally optimal strategy to retrieve records is nontrivial. To combine the benefits of density and locality, we need a cost model for the storage media that can help us reason about their relative benefits. We develop a simple cost model in this paper, and use this to develop an algorithm that is optimal from the perspective of overall I/O (called IOOptimal). We further extend the density and localityoptimal algorithms to develop a hybrid algorithm (called TwoPhase) that fuses the benefits of both approaches. We integrated all four of these algorithms, coupled with indexing structures, into our NeedleTail data exploration engine. On both synthetic and real datasets, we observe that NeedleTail can be several orders of magnitude faster than existing approaches when returning records that satisfy user conditions.
(iv) Aggregate Estimation with Any Results. In some cases, instead of just optimizing for retrieving any, it may be important to use the retrieved results to estimate some aggregate value. Although NeedleTail can retrieve more samples in the same time, the estimate of the aggregate may be biased, since NeedleTail may preferentially sample from certain blocks. This is especially true if the attribute being aggregated is correlated with the layout of data on disk or in memory. We employ survey sampling [31, 42] techniques to support accurate aggregate estimation while retrieving any records. We adapt cluster sampling techniques to reason about blocklevel sampling, along with unequal probability estimation techniques to correct for the bias. With these changes, NeedleTail is able to achieve error rates similar to pure random sampling—our gold standard—in much less time, while returning multiple orders of magnitude more records for the analyst to browse. Thus, even when computing aggregates, NeedleTail is substantially better than other schemes.
Of course, there are other techniques for improving analytical query response time, including inmemory caching, materialized views, precomputation, and materialized samples. These techniques can also be applied to any, and are largely orthogonal to our work. One advantage of our densitymap approach versus much of this related work is that it does not assume a query workload is available or that queries are predictable, which enables us to support truly exploratory data analysis. Similarly, traditional indexing structures, such as B+ Trees, could efficiently answer some any queries. However, to support any queries with arbitrary predicates, we would need B+ Trees on every single attribute or combination of attributes, which often will be prohibitive in terms of space. Bitmap indexes are a more space efficient approach, but even so, storing a bitmap in memory for every single value for every single attribute (10s of values for 100s of attributes) is impossible for large datasets, as we show in our experimental analysis.
Contributions and Outline. The chief contribution of this paper is the design and development of NeedleTail, an efficient data exploration engine for both browsing and aggregate estimation that retrieves samples in orders of magnitude faster than other approaches. NeedleTail’s design includes its density map indexing structure, retrieval algorithms (DensityOptimal, LocalityOptimal, TwoPhase, and IOOptimal) with extensions for complex queries, and statistical techniques to correct for biased sampling in aggregate estimation.
We formalize the browsing problem in Section 2, describe the indexing structures in Section 3, the any sampling algorithms in Section 4 and 5, the statistical debiasing techniques in Section 6, extensions to complex queries in Section 7, and the system architecture in Section 8. We evaluate NeedleTail in Section 9.
2 Problem Formulation
We now formally define the any problem. We consider a standard OLAP data exploration setting where we have a database with a star schema consisting of continuous measure attributes and categorical dimension attributes . For simplicity, we focus on a single database table , with dimension attributes and measure attributes, leading to the schema: ; our techniques generalize beyond this case, as we will show in later sections. We use to denote the number of distinct values for the dimension attribute with distinct values .
Consider a selection query on where the selection condition is a boolean formula formed out of equality predicates on the dimension attributes . We define the set of records which form the result set of query to be the valid records with respect to . As a concrete example, consider a data analyst exploring campaign finance data. Suppose they want to find any individuals who donated to Donald Trump, live in a certain county, and are married. Here, the query on has a selection condition that is a conjunction of three predicates—donated to Trump, lives in a particular county, and is married.
Given the query , traditional databases would return all the valid records for in , irrespective of how long it takes. Instead, we define an any query as the query which returns valid records out of the set of all valid records for a given query . can be written as follows:
For now we consider simple selection queries of the above form; we show how to extend our approach to support aggregates in Section 6 and how to support grouping and joins in Section 7.
We formally state the any sampling problem for simple selection queries as follows:
Problem 1 (any sampling)
Given an any query , the goal of any sampling is to retrieve any valid records in as little time as possible.
Note that unlike random sampling, any does not require the returned records to be randomly selected. Instead, any sampling prioritizes query execution time over randomness. We will revisit the issue of randomness in Section 6. Next, we develop the indexing structures required to support any algorithms.
3 Index Structure
To support the fast retrieval of any samples, we develop a lightweight indexing structure called the DensityMap. DensityMaps share some similarities with bitmaps, so we first briefly describe bitmap indexes. We then discuss how DensityMaps address the shortcomings of bitmap indexes.
3.1 Bitmap Index: Background
Bitmap indexes [46] are commonly used for adhoc queries in readmostly workloads [17, 64, 50, 65]. Typically, the index contains one bitmap for each distinct value of each dimension attribute in a table. Each bitmap is a vector of bits in which the th bit is set to , if for the th record, and 0 otherwise. If a query has a equality predicate on only one attribute value, we can simply look at the corresponding bitmap for that attribute value and return the records whose bits are set in the bitmap. For queries that have more than one predicate, or range predicates, we must perform bitwise AND or OR operations on those bitmaps before fetching the valid records. Bitwise operations can be executed rapidly, particularly when bitmaps fit in memory.
Although bitmap indexes have proven to be effective for traditional OLAPstyle workloads, these workloads typically consist of queries in which the user expects to receive all valid records that match the filter. Nevertheless, bitmap indexes can be used for any sampling. One simple strategy would be to perform the bitwise operations across all predicated bitmap indexes, perform a scan on the resulting bitmap, and return the first records with matching bits. However, the efficiency of this strategy greatly depends on the layout of the valid records. For example, if all valid records are clustered near the end of the dataset, the system would have to scan the entire bitmap index before finding the set bits. Furthermore, returning the first matching records may be suboptimal if the first records are dispersed across the dataset, since retrieving each record would result in a random access. If a different set of records existed later in the dataset, but with better locality, a preferred strategy might be to return that second set of records instead.
In addition to some limitations when performing any sampling, bitmap indexes often take up a large amount of space, since we need to store one bitmap per distinct value of each dimension. As the number of attribute values and dimension attributes increase, a greater number of bitmap indexes is required. Even with various methods to compress bitmaps, such as BBC [12], WAH [67], PLWAH [22], EWAH [40], density maps consume orders of magnitude less space than bitmap indexes, as we show in Section 9.
3.2 Density Map Index
We now describe how DensityMap addresses the shortcomings of bitmap indexes. Our design starts from the observation that modern hard disk drive (HDDs) typically have 4KB minimum storage units called sectors, and systems may only read or write from HDDs in whole sectors. Therefore, it takes the same amount of time to retrieve a block of data as it does a single record. DensityMaps take advantage of this fact to reason about the data at a blocklevel, rather than at the recordlevel as bitmaps do. Similarly, SSD and RAM access pages or cache lines of data at a time.
Thus, for each block, a DensityMap stores the frequency of set bits in that block, termed the density, rather than enumerating the set bits. This enables the system to “skip ahead” to these dense blocks to retrieve the any samples. Further, by storing blocklevel statistics rather than recordlevel statistics, DensityMaps can greatly reduce the amount of indexing space required compared to bitmaps. In fact, DensityMaps can be thought of a form of lossy compression. Overall, by storing densitybased information at the block level, we benefit from smaller size and more relevant information tailored to the any sampling problem.
Formally, for each attribute , and each attribute value taken on by , , we store a DensityMap , consisting of entries, one corresponding to each block on disk. We can express as where is the percentage of tuples in block that satisfy the predicate . Note that while DensityMaps are stored per column, the actual underlying data is assumed to be stored in roworiented fashion.
Example 1
The table in Figure 1 is stored over 9 blocks. The density map for is , indicating 20 percent of tuples in block 1 and 10 percent of tuples in block 2 have value for attribute respectively.
DensityMaps are a very flexible index structure as they can estimate the percentage of valid records for any adhoc query with single or nested selection constraints. For queries with more than one predicate, we can combine multiple DensityMaps together to calculate the estimated percentage of valid records per block, multiplying densities for conjunction and adding them for disjunction. In performing this estimation, we implicitly assume that the DensityMaps are independent, akin to selectivity estimation in query optimization [25]. As in query optimization, this assumption may not always hold, but as we demonstrate in our experiments on real datasets, it still leads to effective results. In particular, DensityMaps drastically reduce the number of disk accesses by skipping blocks whose estimated densities are zero (and thus definitely do not contain valid records). Some readers may be reminded of other statistics used for query optimization, such as histograms [25]. However, unlike histograms which store the overall frequencies of records for entire relations, DensityMaps store this information at a finer blocklevel granularity.
Example 2
In Figure 1, for a given query Q with selection constraints = AND = , the estimated DensityMap after combing and is , indicating (approximately) that block 1 has 2 percent matching records, and block 2 has 3 percent matching records for Q.
Thus, compared to bitmaps, DensityMaps are a coarser statistical summary of valid records in each block for each attribute value. DensityMaps save significant storage costs by keeping information at the blocklevel instead of recordlevel, making maintaining DensityMaps in memory feasible. Moreover, coupled with efficient algorithms, which we describe in detail next, DensityMaps can decrease the number of blocks read from disk for any sampling and therefore reduce the query execution time.
One concern with DensityMap is that, since we admit all records from a block which satisfy the constraints into our any sample set, the samples we retrieve may be biased with respect to the data layout. In Section 6, we describe techniques to correct the bias due to possible correlations between the samples and the data layout by applying cluster sampling and unequal probability estimation techniques.
4 AnyK Algorithms: Extremes
We introduce two algorithms which take advantage of our DensityMaps to perform fast any sampling. The primary insights for these algorithms come from the following two observations.
First, a high density block has more valid records than a low density block. Thus, it is more beneficial to retrieve the high density block, so that overall, fewer blocks are retrieved.
Observation 1 (Density: Denser is better.)
Under the same circumstances, retrieving a block with high density is preferable to retrieving a low density block.
In a HDD, the time taken to retrieve a block from disk can be split into seek time and transfer time. The seek time is the time it takes to locate the desired block, and the transfer time is the time required to actually transfer the bytes of data in the block from the HDD to the operating system. Blocks which are far apart incur additional seek time, while neighboring blocks typically only require transfer time. Thus, retrieving neighboring blocks is preferred. Similar locality arguments hold (to varying degrees) on SSD and RAM.
Observation 2 (Locality: Closer is better)
Under the same circumstances, retrieving neighboring blocks is preferable to retrieving blocks which are far apart.
Our basic any sampling algorithms take advantage of each of these observations: DensityOptimal optimizes for density while LocalityOptimal optimizes for locality. These two algorithms are optimal extremes, favoring just one of locality or density.
On different types of storage media, the two observations can have different amounts of impact. For example, the locality observation may not be as important for inmemory data and solidstate drives (SSDs), since the random I/O performance of these storage media is not as poor as it is on HDDs. For our purposes, we focus on the HDDs, which is the most common type of storage device, but we also evaluate our techniques on SSDs.
To judge which of these two algorithms is better in a given setting, or to combine the benefits of these two algorithms, we require a cost model for storage media, which we present in Section 5.
Table 1 provides a summary of the notation used in the following sections.
Symbol  Meaning 

Number of blocks  
Number of predicates  
Number of samples received  
An empirical constant to sequentially access one block  
DensityMap indicated in the WHERE clause  
the th entry of DensityMap  
Sorted DensityMap indicated in the WHERE clause  
the th entry of the th sorted DensityMap in  
Threshold  
Set of block IDs with their aggregated densities  
Set of block IDs seen so far  
Set of block IDs returned by the algorithm 
4.1 DensityOptimal Algorithm
DensityOptimal is based on the threshold algorithm proposed by Fagin et al. [23]. The goal of DensityOptimal is to use our inmemory DensityMap index to retrieve the densest blocks until valid records are found. The unmodified threshold algorithm by Fagin et al. would attempt to find the densest blocks. However, in our setting, we do not know the value of in advance: we only know , the number of valid tuples required, so we need to set the value of on the fly.
For fast execution of DensityOptimal, an additional sorted DensityMap data structure is required. For every DensityMap , we sort it in descending order of densities to create a sorted DensityMap . Every element has two attributes: , the block ID, and , the percentage of tuples in this block which satisfies the corresponding constraint. Here refers to the first block of the data and refers to the densest block in the data. Sorted DensityMaps are precomputed during data loading time and stored in memory along with the DensityMaps, so the sorting time does not affect the execution times of queries.
HighLevel Intuition. At a high level, the algorithm examines each of the relevant sorted DensityMaps corresponding to the predicates in the query. It traverses these sorted DensityMaps in sorted order, while maintaining a record of the blocks with the highest overall density for the query, i.e., the highest number of valid tuples. The algorithm stops when the maintained blocks have at least valid records, and it is guaranteed that none of the unexplored blocks can have a higher overall density than the ones maintained.
Algorithmic Details. Algorithm 1 provides the full pseudocode. With sorted DensityMaps, it is easy to see how DensityOptimal handles a query with a single predicate: . DensityOptimal simply selects the which corresponds to the predicate and retrieves the first few blocks of until valid records are found. For multiple predicates, the execution of DensityOptimal is more complicated. Depending on how the predicates are combined, could mean , i.e., product, if the predicates are all combined using ANDs, or , i.e., sum, if the predicates are all combined using ORs. Each DensityMap in represents a predicate from the query, while represent the sorted variants. At each iteration, we traverse down , while maintaining a running threshold , and also keeping track of all the block ids encountered across the sorted density maps. This threshold represents the minimum aggregate density that a block must have across the predicates before we are sure that it is one of the densest blocks. During iteration , we consider all blocks in examined in the previous iterations that have not yet been selected to be part of the output. If the one with the highest density has density greater than , then it is added to the output . We know that is an upperbound for any blocks that have not already been seen in this or the previous iterations, due to the monotonicity of the operator . Thus, DensityOptimal maintains the following invariant: a block is selected to be part of the output iff its density is as good or better than any of the blocks that not yet been selected to be part of the output. Overall, DensityOptimal ends up adding the blocks to the output in decreasing order of density. DensityOptimal terminates when the number of valid records in the output blocks selected is at least .
To retrieve the any samples, we then load the blocks returned by DensityOptimal into memory and return all valid records seen in those blocks. If the total number of query results in those blocks are less than , we reexecute DensityOptimal on the blocks that have not been retrieved in previous invocations.
Fetch Optimization. Depending on the order of the blocks returned by DensityOptimal, the system may perform many unnecessary random I/O operations. For example, if DensityOptimal returns blocks , the system may read block , seek to block , and then seek back to block , resulting in expensive disk seeks. Instead, we can sort the blocks before fetching them from disk, thereby minimizing random I/O and overall query execution time.
Guarantees. We now show that DensityOptimal retrieves the minimum set of blocks when optimizing for density.
Theorem 1 (Density Optimality)
Under the independence assumption, DensityOptimal returns the set of blocks with the highest densities with at least valid records.
Since DensityOptimal is a significant modification of the threshold algorithm the proof of the above theorem does not follow directly from prior work. {proof} The proof is composed of two parts: first, we demonstrate that DensityOptimal adds blocks to in the order of decreasing overall density; second, we demonstrate that DensityOptimal stops only when the number of valid records in is . The second part is obvious from the pseudocode (line 16). We focus on the first part. The first part is proven using an inductive argument. We assume that the blocks added to through th iteration satisfy the property and that of the th iteration is denoted as . We note that for the th iteration, . Consider the blocks that are part of at the end of line 10 in the th iteration. These blocks fall into two categories: either they were already part of in the th iteration, and hence have densities less than , or were added to in the th iteration, and due to monotonicity, once again have density less than . Furthermore, any blocks that have not yet been examined will have densities less than . Since all blocks that have been added at iteration or prior have densities greater than or equal to , all the blocks still under contention for adding to —those in or those yet to be examined—have densities below those in . Now, in iteration , we add all blocks in whose densities are greater than , in decreasing order. We know that all of these blocks have higher densities than all the blocks that have yet to be examined (once again using monotonicity). Thus, we have shown that any blocks added to in iteration are lower in terms of density than those added to previously, and are the best among the ones in and those that will be encountered in future iterations.
4.2 LocalityOptimal Algorithm
Our second algorithm, LocalityOptimal, prioritizes for locality rather than density, aiming to identify the shortest sequence of blocks that guarantee valid records. The naive approach to identify this would be to consider the sequence formed by every pair of blocks (along with all of the blocks in between)—leading to an algorithm that is quadratic in the number of blocks. Instead, LocalityOptimal, described below, is linear in the number of blocks.
Highlevel Intuition. LocalityOptimal moves across the sequence of blocks using a sliding window formed using a start and an end pointer, and eventually returns the smallest possible window with valid records. At each point, LocalityOptimal ensures that the window has at least valid records within it, by first advancing the end pointer until we meet the constraint, then advancing the start pointer until the constraint is once again violated. It can be shown that this approach considers all minimal sequences of blocks with valid records. Subsequently, LocalityOptimal returns the smallest such sequence.
Algorithmic Details. The pseudocode for the algorithm is listed in Algorithm 2. The LocalityOptimal algorithm operates on an array of values formed by applying the operator to the predicate DensityMaps , one block at a time. At the start, both pointers are at the value corresponding to the density of the first block. We move the end pointer to the right until the number of valid records between the two pointers is no less than ; at this point, we have our first candidate sequence containing at least valid records. We then move the start pointer to the right, checking if each sequence contains at least valid records, and continuing until the constraint of having at least valid records is once again violated. Afterwards, we once again operate on the end pointer. At all times, we maintain the smallest sequence found so far, replacing it when we find a new sequence that is smaller.
Guarantees. We now show that LocalityOptimal retrieves the minimum sequence of blocks when optimizing for locality.
Theorem 2 (Locality Optimality)
Under the independence assumption, LocalityOptimal returns the smallest sequence of blocks that contains at least valid records.
We demonstrate that for every block , LocalityOptimal considers the smallest sequence of blocks with valid records beginning at block at some point in the algorithm, thereby proving the above theorem. {proof} For , this is easy to see. The end pointer of LocalityOptimal starts at 1 and increases; the start pointer is not moved until a valid sequence of blocks is found, so by construction LocalityOptimal considers the smallest sequence of blocks starting at 1. For the remaining ’s we prove this by contradiction. Let the smallest sequence of blocks beginning at block end at , where ; we denote this sequence as . Now, let be the ending block for the smallest sequence of blocks starting at ; this sequence is denoted as . If , our LocalityOptimal algorithm considers the sequence as we move the end pointer forward (lines 68 in the pseudocode). Assume to the contrary that ; that is, the sequence is the smallest sequence of blocks starting at that has valid records. is a subsequence of , so must also have at least valid records. However, we already declared to be the smallest sequence of blocks starting at that has valid records, and thus a contradiction is found. Similar arguments can be made for all , so LocalityOptimal must consider the smallest sequence of blocks starting at block for every .
5 Hybrid ANYK Algorithms
The two desired properties of density and locality can often be at odds with each other depending on the data layout; dense blocks may be far apart, and neighboring blocks may contain many blocks which have no valid records. In this section, we first present a cost model to estimate the I/O cost of a any algorithm, and use it to design an any algorithm that is I/Ooptimal, providing the best balance between density and locality, and a hybrid algorithm, that selects between DensityOptimal and LocalityOptimal.
5.1 A Simple I/O Cost Model
To set up the cost model for I/O for HDDs (Hard Disk Drives), we profile the storage system as described by Ruemmler et al. [47]. We randomly choose various starting blocks and record the time taken to fetch other blocks that are varying distances away (where distance is measured in number of blocks), for distance onwards. As shown in Figure 2, which uses a linear scale on the xaxis from x=2 until x=10, and then logarithmic scale after that, we observe that with block size equal to 256KB, the I/O cost is smallest when doing a sequential I/O operation to fetch the next block ( 2ms), and increases with the distance up to a certain maximum distance after which it becomes constant ( 12ms). We have overlaid our cost model estimate using a dashed blue line. More formally, for two blocks and , we model the cost of fetching block after block as follows:
When distance is less than , we use a simple linear fit for , using
the Python numpy.polyfit
function
On the other hand, as shown in Figure 2, the I/O Cost Model for SSDs is different from the one we see for HDDs. Overall, we see a constant time ( 0.6ms) to fetch a block (overlaid in a dashed blue line) independent of the block distance.
5.2 IOOptimal Algorithm
IOOptimal considers both density and locality to search for the set of blocks that will provide the minimum I/O cost overall. Specifically, given our cost model, we can use dynamic programming to find the optimal set of blocks with valid records.
We define as the minimal cost to retrieve estimated valid records when block is amongst the blocks fetched. We define as the cost to retrieve the optimal set of blocks with estimated valid records when considering the first blocks. Finally, we denote as the estimated number of valid records inside block , derived, as before, using the computation. With this notation, we have:
The intuition is as follows: for each block that has estimated valid records, either the block can be in the final optimal set or not. If we decide to include block , the cost is the minimum cost amongst the following: (i) the smallest I/O cost of having samples at block where , plus the cost of jumping from block to (i.e., ), or (ii) the optimal cost at block , plus the random I/O cost of jumping from some block in the first blocks to block (i.e., .
For the second expression, if we exclude block , then the optimal cost is the same as the optimal cost at block . Consequently, the optimal cost at block is the smallest value in these two cases. The full algorithm is shown in Algorithm 3, where is some constant cost to fetch the first block.
Guarantees. We can show the following property.
Theorem 3 (IOOptimal)
Under the independence assumption and the constructed cost model for disk I/O, IOOptimal gives the blocks with optimal I/O cost for fetching any valid records.
The proof is listed in full detail in Appendix A.1.
5.3 Hybrid Algorithm
Even though IOOptimal is able to return the optimal I/O cost for fetching any samples, its much higher computation cost (as we show in our experiments) makes it impractical for large datasets. We propose Hybrid which simply selects between the best of DensityOptimal and LocalityOptimal, using our I/O cost model, when a query is issued. Since Hybrid needs to run both algorithms to determine the set of blocks selected by each algorithm, using Hybrid would involve a higher upfront computational cost, but as we will see, leads to substantial performance benefits.
6 Aggregate Estimation
So far, our any algorithms retrieve records without any consideration of how representative they are of the entire population of valid records. If these records are used to estimate an aggregate (e.g., a mean), there could be bias in this value due to possible correlations between the value and the data layout. While this is fine for browsing, it leaves the user unable to make any statistically significant claims about the aggregated value. To address this problem, we make two simple adjustments to extend our any algorithms. First, we introduce a TwoPhase sampling scheme where we add small amounts of random data to our any estimates. This random data is added in a fashion such that it does not significantly affect the overall running time, while at the same time, allowing us to “correct for” the bias easily. Second, we correct the bias by leveraging techniques from the survey sampling literature. Specifically, we leverage the HorvitzThompson [31] and ratio [42] estimators, as described below. Note that such approaches have been employed in other settings in approximate query processing [44, 41, 21, 32, 36], but their application to any is new.
6.1 TwoPhase Sampling
We propose a TwoPhase sampling scheme, in which we collect a large portion of the requested samples using an any algorithm, and collect the rest in a random fashion. We denote as the proportion of samples we retrieve using the any algorithm, and as the proportion of samples we retrieve using random sampling. The user chooses the parameter upfront based on how much random sampling they wish to add. While a larger may reduce the number of total samples needed to obtain a statistically significant result, the time taken to retrieve random samples greatly exceeds the time taken to retrieve samples based on our any algorithms. Therefore, needs to be carefully chosen; we experiment with different s in Section 9.
More formally, if we let be the set of blocks which have at least one valid record in them, we can describe TwoPhase sampling as follows: (1) Use an any algorithm to choose the densest blocks from , and derive samples from . (2) Uniformly randomly select blocks from the remaining blocks, and derive samples from . Note that .
6.2 Unequal Probability Estimation
Within the TwoPhase sampling scheme, the probability a block is sampled is not uniform. Therefore, we must use an unequal probability estimator [56] and inversely weigh samples based on their selection probabilities. We introduce two different estimators for this: the HorvitzThompson estimator and the ratio estimator.
Requisite Notation
The goal of our two estimators is to estimate the true aggregate sum and the true aggregate mean of measure attribute given a query . We use to denote the aggregate sum of for block and for the total number of valid records for query . We can estimate using the DensityMaps.
The estimators inversely weigh samples based on their probability of selection. So, we define the inclusion probability as the probability that block is included in the overall sample:
For the samples that come from the any blocks in , the probability of being chosen is always 1. After these blocks have been selected, a uniformly random subset of the remaining blocks are chosen to produce the random samples; thus the probability that these samples are chosen is .
We define the joint inclusion probability as the probability of selecting both blocks and for the overall sample:
HorvitzThompson Estimator
Using the HorvitzThompson [31] estimator, is estimated as:
(1) 
As mentioned before, the sums are inversely weighted by their probabilities to account for the different probabilities of selecting blocks in . Based on , we can also easily estimate by dividing the size of the population:
(2) 
The HorvitzThompson estimator guarantees us that both and are unbiased estimates: and . A full proof can be found in [31]. In addition, the HorvitzThompson estimator gives us a way to calculate the variances of of and , which represent the expected bounds of and :
(3) 
(4) 
Ratio Estimator
Although the HorvitzThompson estimator is an unbiased estimator, it is possible that the variances given by Equations 3 and 4 can be quite large if the aggregated variable is not well related to the inclusion probabilities [56]. To reduce the variance, the ratio estimator [42] may be used:
(5) 
(6) 
where is the number of valid records in block . The variances of and are given by:
(7) 
(8) 
While the ratio estimator is not precisely unbiased, in Equation 5, we see that the numerator is the unbiased HorvitzThompson estimate of the sum and the denominator is an unbiased HorvitzThompson estimate of the total number of valid records, so the bias tends to be small and decreases with increasing sample size.
We compare the empirical accuracies of these two estimators in Section 9, and demonstrate how our TwoPhase sampling technique, when employing these estimators, provides accurate estimates of various aggregates values.
7 Grouping and Joins
So far, we have assumed that all our sampling queries have the form dictated by the SELECT query given in Section 2, thus limiting our operations to a single database table, with simple selection predicates and no groupby operators. We now extend the any sampling problem formulation and our algorithms to handle more complex queries that involve grouping and join operations.
7.1 Supporting Grouping
Rather than computing a simple any, users may want to retrieve values per group, e.g., to compute an estimate of an aggregate value in each group.
Although a trivial way to do this would be to run a separate any query per group, in this section we discuss an algorithm that can share the computation across groups in the common case when users want values per group.
Consider an any query over a table with representing the predicate in the where clause. Let be the grouping attribute with values in . The formal goal of this grouped any sampling can be stated as:
Problem 2 (grouped any sampling)
Given a query defined by a predicate S on table , and a grouping attribute , the goal of grouped any sampling is to retrieve any valid records for each group in as little time as possible.
Our basic approach is to create a combined density map, which takes into account every group in the groupby operation, and run the any algorithm for all groups at once. This is akin to sharing scans in traditional databases.
In order to run our any algorithms for all groups, we first define the combined density of the th block as multiplication of two factors: (1) the density of the th block with respect to predicate , and (2) the sum of the densities for group values in in the th block which still need to be sampled. The first factor has been discussed previously, while the second factor can be defined as:
(9) 
where RPB is ,
is the number of samples already
retrieved for group ,
and
is the density of the th block for the value .
The expression inside the function
estimates the number of expected records in block for each group
, but limits the estimate by the number of samples left to be retrieved
for that group
With this combined density estimate , we can now construct an iterative any algorithm for grouped sampling operations, similar to the algorithms in Sections 4 and 5. The main structure of the algorithm is as follows: (1) Update all densities using with . (2) Run one of the any algorithms to retrieve blocks with the highest combined density. (3) Update the densities of the blocks as 0. (4) If samples still have not been retrieved for each group, go back to step (1). Since depends on the number of samples already retrieved, it must be updated periodically to ensure the correctness of the combined densities. The parameter controls the periodicity of these updates. The problem of setting becomes a tradeoff between CPU time and I/O time. Setting updates the densities after every block retrieval; while this more correctly prioritizes blocks and is likely to lead to fewer blocks retrieved overall, there is a high CPU cost in updating the densities after each block retrieval. As increases, the CPU cost goes down due to less frequent updates, but the overall I/O cost is likely to go up since the combined densities are not completely uptodate for each block retrieved. Although our iterative algorithm is not particularly complex, and globally IOoptimal solutions may perform better than our locally optimal solution, our algorithm has the advantage of simplicity of implementation and likely lower CPU overhead. We defer consideration of more sophisticated algorithms to future work.
Algorithm Details. The full algorithm for the grouping any query is shown in Algorithm 4. In Section 4, was a single value representing the number of samples retrieved; for the grouping any algorithm, is now an array of size where each entry represents the number of samples for that group. Every iteration consists of updating the combined density estimates or the priorities of blocks based on the number of samples retrieved (setting it to 0 if it has already been seen), and calling an any algorithm with and the number of blocks desired . The algorithm updates the counts of and the algorithm only ends once every entry in is at least .
As we show in the next section, the keyforeign key join any problem is essentially equivalent to this grouped any formulation, and we evaluate the performance of our algorithm on these problems in Section 9.7.
Optimal Solution. As mentioned, the grouping any solution uses a heuristic to find the best blocks to retrieve. However, an I/O optimal solution, similar to IOOptimal from Section 5.2, could be derived using dynamic programming with a recursive relationship based on the notion of priority from Equation 9, , and the disk model from Section 5.2. Unfortunately, the resulting dynamic programming solution becomes a complex program of even more dimensions than the program from Section 5.2. Since we already showed in Section 9 that IOOptimal incurs a prohibitively high CPU cost in exchange for its optimal I/O time, we chose not to pursue this avenue.
Multiple Groupings. For multiple groupby attributes, we simply extend the above formula to account for every possible combination of values from the different groupings. For example, if we have two groupby attributes and , we can specify our updated notion of density with
7.2 Supporting KeyForeign Key Joins
Consider any sampling on the result of a keyforeign key join between two tables and , where is the primary key in , and is the foreign key in . Similar to grouping, the formal definition of join any sampling can be defined as:
Problem 3 (join any sampling)
Given a query defined by a predicate and a join over tables and , on primary key from table and foreign key from table , the goal of join any sampling is to retrieve any valid joined records for each join value in as little time as possible.
For example, if we want to join on a “departments” attribute, samples would be retrieved for each department.
Since we assume is the primary key, and therefore unique, the join any sampling problem can be reduced to finding any valid records in table for each join value . However, this is the exact same problem as the grouped any sampling problem in which the group values are . Thus, we can use the algorithm described in the previous section, using the values of as the grouping value on the foreign key table .
In this way, NeedleTail is able to best indicate the blocks that can be retrieved to minimize the overall time for joins. We evaluate our join any algorithm in Section 9.7. We leave optimizations for other join variants as future work.
8 System Design
We implemented our DensityMaps, any algorithms, and aggregate estimators in a system called NeedleTail. NeedleTail is developed as a standalone browsingbased data exploration engine, capable of returning individual records as well as estimating aggregates. NeedleTail can be invoked by various frontends, e.g., SQL IDEs or interfaces such as Tableau or Excel. Figure 3 depicts the overall architecture of NeedleTail. It includes four major components: the any module, the random sampling module, the index, and the disk access module. The any module receives queries from the user and executes our any algorithms from Sections 4, 5, and 7 to return any browsing samples as quickly as possible. For aggregate queries, the random sampling module is used in conjunction with the any module to perform the TwoPhase sampling from Section 6. The index contains the DensityMaps and sorted DensityMaps. Finally, the disk access module is in charge of interacting with the buffer pool to retrieve blocks from disk. Since DensityMaps are a lossy compression of the original bitmaps, it is possible that some blocks with no valid records may be returned; these blocks are filtered out by the disk access module.
Our NeedleTail prototype is currently implemented in C++ using about 5000 lines of code. It is capable of reading in roworiented databases with int, float, and varchar types and supports Booleanlogic predicates. Although the current implementation is limited to a single machine, we plan to extend NeedleTail to run in a distributed environment in the future. We believe the collective memory space available in a distributed environment will allow us to leverage the DensityMaps in even better ways.
9 Performance Evaluation
In this section, we evaluate NeedleTail, focusing on runtime, memory consumption, and accuracy of estimates. We show that our DensityMapbased any algorithms outperform any “firsttosamples” algorithms using traditional OLAP indexing structures such as bitmaps or compressed bitmaps on a variety of synthetic and real datasets. In addition, we empirically demonstrate that our TwoPhase sampling scheme is capable of achieving as accurate an aggregate estimation as random sampling in a fraction of the time. Then, we demonstrate that our join any algorithms provide substantial speedups for keyforeign key joins. We conclude the section with an exploration into the effects of different parameters on our any algorithms.
9.1 Experimental Settings
We now describe our experimental workload, the evaluated algorithms, and the experimental setup.
Synthetic Workload: We generated 10 clustered synthetic datasets using the data generation model described by Anh and Moffat [11]. Every synthetic dataset has 100 million records, 8 dimension attributes, and 2 measure attributes. For the sake of simplicity, we forced every dimension attribute to be binary (i.e., valid values were either 0 or 1), and with measure attributes being sampled from normal distributions, independent of the dimension attributes. For each dimension attribute, we enforced an overall density of 10%; the number of 1’s for any attribute was 10% of the overall number of records. Since we randomly generated the clusters of 1’s in each attribute value, we ran queries with equalitybased predicates on the first two dimensional attributes (i.e., and ). Note that this does not always result in a selectivity of 10% since the records whose may not have .
Real Workload: We also used two real datasets.

Airline Dataset [1]: This dataset contained the details of all flights within the USA from 1987–2008, sorted based on time. It consisted of 123 million rows and 11 attributes with a total size of 11 GB. We ran 5 queries with 1 to 3 predicates on attributes such as origin airport, destination airport, flightcarrier, month, day of week. For our experiments on error (described later), we estimated the average arrival delay, average departure delay, and average elapsed time for flights.

NYC Taxi Dataset [5]: This dataset contained logs for a variety of taxi companies in New York City for the years 2014 and 2015. The dataset as provided was first sorted by the year; within each year, it was first sorted on the three taxi types and then on time. It consisted of 253 million rows and 11 attributes with a total size of 21 GB. We ran 5 queries with 1 to 2 predicates on attributes including pickup location, dropoff location, time slots, month, passenger count, vendors, and taxi type. For our experiments on error, we estimated the average fare amount and average distance traveled for the chosen trips.
Algorithms: We evaluated the performance of the three any algorithms presented in Section 4 and 5: (i) IOOptimal, (ii) DensityOptimal, (iii) LocalityOptimal, and (iv) Hybrid We compared our algorithms against the following four “firsttosamples” baselines. BitmapScan and DiskScan are representative of how current databases implement the LIMIT clause.

BitmapScan: Assuming we have bitmaps for every predicate, we use bitwise AND and OR operations to construct a resultant bitmap corresponding to the valid records. We then retrieve the first records whose bits are set in this bitmap.

LossyBitmap [62]: LossyBitmap is a variant of bitmap indexes where a bit is set for each block instead of each record. For each attribute value, a set bit for a block indicates that at least one record in that block has that attribute value. During data retrieval, we perform bitwise AND or OR operations and on these bitmaps then fetch records from the first few blocks which their bit set. Note that this is equivalent to a DensityMap which rounds its densities up to 1 if it is .

DiskScan: Without using any index structures, we continuously scan the data on disk until we retrieve valid records.
For our experiments on aggregate estimation, we compared our TwoPhase sampling algorithms against the baseline BitmapRandom, which is similar to BitmapScan, except that it selects random records among all the valid records. We describe our setup for the join any experiments in Section 9.7.
Setup: All experiments were conducted on a 64bit Linux server with 8 3.40GHz Intel Xeon E31240 4core processors and 8GB of 1600 MHz DDR3 main memory. We tested our algorithms with a 7200rpm 1TB HDD and a 350GB SSD. For each experimental setting, we ran 5 trials (30 trials for the random sampling experiments) for each query on each dataset. In every trial, we measured the endtoend runtime, the CPU time, the I/O time, and the memory consumption. Before each trial, we dropped the operating system page cache and filled it with dummy blocks to ensure the algorithms did not leverage any hidden benefits from the page cache. To minimize experimental variance, we discarded the trials with the maximum and minimum runtime and reported the average of the remaining. Finally, after empirically testing a few different block sizes, we found 256KB to be a good default block size for our datasets: the block size does not significantly impact the relative performance of the algorithms.
9.2 Query Execution Time
Summary: In the synthetic datasets on a HDD, our Hybrid any sampling algorithm was on average 13 faster than the baselines. For the real datasets, Hybrid performed at least as well as the baselines for every query, and on average was 4 and 9 faster for queries on HDDs and SSDs respectively.
Synthetic Experiments on a HDD. Figure 4 presents the runtimes for Hybrid, DensityOptimal, LocalityOptimal, and the four baselines for varying sampling rates. (We will evaluate IOOptimal later on.) Sampling rate is defined to be the ratio of divided by the number of valid records. Since the queries can have a wide variety in the number of valid records, we decided to plot the impact on varying sampling rate rather than . (Results for varying are similar.) The bars in the figure above represent the average runtimes for five sampling rates over 10 synthetic datasets. Note that the figure is in logscale.
Regardless of the sampling rate, DensityOptimal, Hybrid, and LocalityOptimal significantly outperformed BitmapScan, LossyBitmap, EWAH, and DiskScan, with speedups of an order of magnitude. For example, for a sampling rate of 1%, DensityOptimal, LocalityOptimal, and Hybrid took 74ms, 45ms, and 58ms on average respectively, while BitmapScan, LossyBitmap, EWAH, and DiskScan took 647ms, 624ms, 662ms and 630ms on average respectively. Thus, our any algorithms are more effective at identifying the right sequence of blocks that contain valid records than the baselines which do not optimize for any—the baselines are subject to the vicissitudes of random chance: if there are large number of valid records early on, then they will do well, and not otherwise. This is despite the fact that BitmapScan and EWAH store more finegrained information than our algorithms, and are therefore able to effectively skip over blocks without valid records.
There was no consistent winner between DensityOptimal and LocalityOptimal across sampling rates and queries, but Hybrid always selected the faster algorithm from the two and thus had an average speedup of 13 over the baselines. Despite that, Hybrid’s performance on lower sampling rates (0.1%, 1%) is a bit worse than DensityOptimal and LocalityOptimal, since it has to run both algorithms and pick the better one: but this difference in performance is small—around 10ms. From 5% onwards, Hybrid’s performance is clearly better than DensityOptimal and LocalityOptimal, since the increase in computation time is dwarfed by the improvement in I/O time.
Real Data Experiments on a HDD. Figures 5 and 6 show the runtimes of our algorithms over 5 diverse queries for the airline and taxi workloads respectively. For each query and sampling rate, we normalized the runtime of each algorithm by the largest runtime across all algorithms, while also reporting actual runtime (in ms) taken by Hybrid and the maximum runtime. We omitted DiskScan since DiskScan was found to be have the worst runtime in the previous experiment, and similarly performs poorly here. For the real workloads, we noticed that the runtimes of the queries were much more varied, so we report the average runtime for each query separately.
For the airline workload, we noticed that our any algorithms consistently outperformed the bitmapbased baselines: DensityOptimal had a speedup of up to 8 compared to BitmapScan and EWAH, while LocalityOptimal had a speedup of up to 7. Across all queries, when sampling rate equals 1%, DensityOptimal and LocalityOptimal were on average 3 and 5 faster than BitmapScan and EWAH, despite having a much smaller memory footprint (Section 9.3). For example for Q3, which had two predicates on month and origin airport, the block with the highest density contained 1% samples already. Moreover, since the airline dataset is naturally sorted on time attributes (e.g., year, month), the valid tuples were more likely to clustered in a few number of blocks. Therefore, compared with LocalityOptimal, DensityOptimal fetched up to 10% less blocks, resulting in less query execution time than LocalityOptimal in all of cases. For the small additional cost of estimating the sequence of blocks for both LocalityOptimal and DensityOptimal, Hybrid ends up always selecting the faster algorithm in both this and the taxi workload, with an average speedup of 4. For example, for Q4 with 1% sampling rate Hybrid’s time is closer to DensityOptimal, and half of that of LocalityOptimal.
We noticed a different (and somewhat surprising) trend for the taxi workload. Here, Hybrid continued to do well, and much better than the worst algorithm on every setting, with an average speedup of 4 compared to the baselines. Similarly, LocalityOptimal performed similar or better than the baselines for every experiment. However, on multiple occasions, we found that DensityOptimal was slower than the baselines, and was the worst algorithm, e.g., in Q3 and Q5. Upon closer examination, we found that DensityOptimal did in fact retrieve the fewest number of blocks for every query. However, the taxi dataset was much larger than the airline dataset, so the blocks were more spread out, and the time to seek from block to block went up significantly. As a result, we found the localityfavoring LocalityOptimal to perform better on a HDD where seeks were expensive. To further exacerbate the issue, we found that the taxi workload also had a much more uniform distribution of tuples; the tuples that satisfied query predicates (which were not based on taxi type) were spread fairly uniformly across the dataset. In some sense, this made the dataset “adversarial” for densitybased schemes. In other words, it is hard to conclude either DensityOptimal or LocalityOptimal is better than another, given their performance depends on the distribution of valid tuples of a given adhoc query—and it is therefore safer to use Hybrid to pick between the two.
Real Data Experiments on a SSD. We also ran the same workload on SSD; SSDs have random I/O performance that is comparable to sequential I/O performance. The results are depicted in Figures 7 and 8. We omit Hybrid, since Hybrid always selects DensityOptimal over LocalityOptimal due to the fact that DensityOptimal fetches the smallest number of blocks. Overall, the performance of DensityOptimal is much faster than the bitmapbased baselines, with average speedups of 14 and 6 in the airline and taxi workload respectively. There were two exceptions: Q1 (10%) in airline and Q3 in taxi, where the total number of blocks fetched by DensityOptimal, BitmapScan, LossyBitmap, and EWAH were similar. In this uncommon situation, even though DensityOptimal has the lowest I/O time, the CPU cost of checking for valid records in each block was slightly higher, thus its runtime was a little higher than BitmapScan and EWAH.
9.3 Memory Consumption
Summary: DensityMaps consumed on average 48 less memory than the regular bitmaps and 23 less memory than EWAHcompressed bitmaps.
Dataset  Disk Usage  # Tuples  Cardinality  Bitmap  EWAH  LossyBitmap  DensityMap 
Synthetic  7.5 GB  100M  16  190.73MB  182.74MB  0.06MB  3.73MB 
Taxi  21 GB  253M  64  1936.99MB  663.63MB  0.65MB  41.63MB 
Airline  11GB  123M  805  11852.33MB  744.05MB  3.98MB  254.72MB 
Table 2 reports the amount of memory used by DensityMaps compared to the other three bitmap baselines. We observed that DensityMaps were very lightweight and consumed around 51, 47, and 47 less memory than uncompressed bitmaps respectively in the three datasets. Even with EWAHcompression, we observed an almost 49 reduction in size for the taxi dataset for DensityMaps relative to EWAH. In the airline dataset, since the selectivity of each attribute value is low, EWAH compressed the bitmaps much better than in the other two datasets. Still, EWAH consumed 3 more memory than DensityMap. Lastly, since LossyBitmap requires only one bit per block while DensityMap is represented as a 64bits double per block respectively, LossyBitmap unsurprisingly consumed less memory than DensityMap. However, as we showed in Section 9.2, the smaller memory consumption incurred a large cost in query latency due to the large number of false positives (e.g., Q3 with sampling rate 10% in Figure5); especially when the number of predicates is large and exhibit complex correlations. In comparison, the DensityMapbased any algorithms were orders of magnitude faster than the baselines, while still maintaining a modest memory footprint ( of original dataset).
9.4 IOOptimal Performance
Summary: IOOptimal had up to 3.9 faster I/O time than Hybrid and the best I/O performance among all the algorithms described above. However, its large computational cost made it impractical for real datasets.
For the evaluation of IOOptimal, we used a smaller synthetic dataset of 1 million records and a block size of 4KB, and conducted the evaluation on a HDD. We compared its overall endtoend runtime, CPU time, and I/O time with every other algorithm, and found that it consistently had the best I/O times. However, we found that computational cost of dynamic programming in IOOptimal outweighed any benefits from the shorter I/O time. Consequently, we found IOOptimal to be impractical for larger datasets. Figure 9 shows both the overall times and I/O times for IOOptimal and Hybrid for varying sampling rates.
9.5 Time vs Error Analysis
Summary: Compared to random sampling using bitmap indexes, our TwoPhase sampling schemes that mix samples from any sampling algorithms with a small percentage of random cluster samples attained the same error rate in much less time.
Using the TwoPhase sampling techniques in Section 6, we can obtain estimates of aggregate values on data; here we experiment with random samples, and use the DensityOptimal algorithm, since it ended up performing the most consistently well across queries and workloads, for SSDs and HDDs. We compared these results with pure random sampling (BitmapRandom) using bitmaps on a HDD. We used the same set of queries as in Section 9.2. For each query, we varied the sampling rate and measured the runtime and the empirical error of the estimated aggregate with respect to the true average value. Figure 10 depicts the average results for both the HorvitzThompson estimator and the ratio estimator. In log scale. We’ll start with the taxi dataset and the ratio estimator. Figure (a)a shows that if all the sampling schemes are allowed to run for 500ms (commonly regarded as the threshold for interactivity), DensityOptimal, TwoPhase sampling with , TwoPhase sampling with , and BitmapRandom have average empirical error rates of 29.64%, 4.83%, 3.66% and 19.64%, respectively; the corresponding number of the samples retrieved are 11102, 7977, 5684, 35 respectively. Thus, the TwoPhase sampling schemes are able to effectively correct the bias in DensityOptimal, while still retrieving a comparable amount of samples. Furthermore, note that BitmapRandom suffers from the same problem as BitmapScan in large memory consumption. In contrast, even though DensityOptimal was not the fastest algorithm in the taxi workload, our TwoPhase sampling algorithms cluster sample at the block level and only need access to the much more compressed DensityMaps.
The behavior on the airline workload is somewhat different: here we find that DensityOptimal performs better than the TwoPhase sampling scheme with the ratio estimator for the initial period until about 100ms, after which the TwoPhase sampling schemes perform better than DensityOptimal and BitmapScan. We found this behavior repeated across other queries and trials: DensityOptimal sometimes ends up having very low error (like in Figure (b)b), and sometimes fairly high error (like in Figure (a)a), but the TwoPhase sampling schemes consistently achieve low error relative to DensityOptimal. This is because DensityOptimal’s accuracy is highly dependent on the correlation between the data layout and the attribute of interest, and can sometimes lead to highly biased results. At the same time, the TwoPhase sampling schemes return much more samples and much more accurate estimates than BitmapRandom, effectively supporting browsing and sampling at the same time.
Between the HorvitzThompson estimator and the ratio estimator, the ratio estimator often had higher accuracies. As explained in Section 6, the ratio estimator works quite well in situations where aggregation estimate is not correlated with the block densities. We found this to be the case for both the airline and taxi workloads, so the ratio estimator helped for both these workloads.
9.6 Effect of Parameters
To explore the properties of our any algorithms, we varied various parameters and noted their effect on overall runtimes for synthetic workloads. Varied parameters included: (i) data size, (ii) number of predicates, (iii) density, (iv) block size, and (v) granularity.
Data Size: We varied the synthetic dataset size from 1 million to 1 billion, but we found that the overall runtimes of our any algorithms remained relatively the same. Our algorithms return only a fixed number of samples and explicitly avoid reading the entire dataset, so it makes sense that the runtimes stay consistent even when the data size increases.
Number of Predicates: As we increased the number of predicates in a query, we saw that overall runtimes increase as well. Since our predicates were combined using ANDs, an increase in the number of predicates meant a decrease in the number of valid records per block. Therefore, both DensityOptimal and LocalityOptimal needed to fetch more blocks to retrieve the same number of samples, and this caused an increase in the overall runtime.
Density: As we increased the overall density of valid records in the dataset, the runtimes for our any algorithms got faster. As the overall density increased, the average density per block also increased, so our any algorithms could retrieve fewer blocks to achieve the same number of samples.
Block Size: We tried varying the block sizes of our datasets from 4KB, to 256KB, to 1MB, to 2MB. We found that as we decreased the block sizes, the runtimes for DensityOptimal increased drastically because smaller block sizes meant that more random I/O was being done. However, we did not see any definite correlation as we increased the block size. Although larger block sizes do bias the algorithms toward more locality, they also mean density information is collected at a coarser granularity. We suspect that this tradeoff prevented us from seeing any improvements in performance with increased block size.
9.7 KeyForeign Key Join Performance
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  2465.67  215.678  67.6398  63.7114  73.7204 
0.5%  2632.32  1049.92  307.534  300.023  301.193 
1.0%  2761.74  1424.12  616.090  593.258  591.201 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  1266.06  110.63  33.59  32.38  68.16 
0.5%  1359.80  548.88  154.34  159.45  153.65 
1.0%  1418.73  737.64  308.00  299.78  307.44 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  12415.26  1114.80  417.18  347.93  314.27 
0.5%  13019.96  5193.80  2068.75  1708.97  1476.64 
1.0%  13651.02  7143.95  4167.87  3478.81  2971.74 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  1930.30  134.84  47.62  46.94  95.57 
0.5%  2069.74  628.99  223.42  228.43  225.59 
1.0%  2197.38  904.28  453.56  452.34  447.00 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  7223.28  926.99  278.35  217.47  165.21 
0.5%  7348.53  4460.22  1422.41  1108.15  886.90 
1.0%  7573.63  5758.62  2859.31  2220.21  1770.48 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  2220.28  226.89  62.43  65.48  74.75 
0.5%  2389.59  1110.59  312.64  303.34  304.42 
1.0%  2509.94  1478.30  625.06  605.06  606.28 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  5177.96  245.09  70.56  75.44  85.31 
0.5%  5453.41  1172.76  361.24  350.30  351.10 
1.0%  5692.40  1618.83  722.81  696.12  698.50 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  2528.40  68.33  60.85  58.91  68.16 
0.5%  2699.45  326.55  280.79  278.06  279.73 
1.0%  2843.45  651.10  558.19  547.48  559.85 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  2445.25  1250.18  2148.35  1845.65  1603.79 
0.5%  2597.23  1513.81  2304.90  2008.87  1759.71 
1.0%  2757.08  1704.73  2421.01  2116.32  1873.40 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 

0.1%  1527  1527  1284  1284  1284 
0.5%  1527  1527  1284  1284  1284 
1.0%  1527  1527  1284  1284  1284 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  62154.22  142.40  56.50  54.40  78.98 
0.5%  63513.28  664.69  273.63  259.14  252.80 
1.0%  64099.30  1308.80  542.67  522.12  510.73 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 
0.1%  1673.12  1077.73  1783.76  1706.39  1647.18 
0.5%  1704.48  1120.75  1823.00  1742.84  1711.86 
1.0%  1742.57  1186.66  1841.09  1734.92  1721.47 
Sampling Rate  SharedScan  BitmapCombined  DensityCombined (=5)  DensityCombined (=10)  DensityCombined (=50) 

0.1%  1526  1526  1525  1525  1525 
0.5%  1527  1527  1525  1525  1525 
1.0%  1527  1527  1525  1525  1525 
Summary: Our iterative join any algorithm has an average speedup of 3 compared to existing baselines not optimized for any.
We now evaluate the extension of our any algorithms to keyforeign key joins from Section 7. We compare the performance of our join algorithm with two baselines: (1) SharedScan: a single scan of the foreign key table, shared across different join attribute values, with no indexes and (2) BitmapCombined: a single scan of the foreign key table, shared across different join attribute values, with bitmap indexes to skip to the next valid record which can serve as a sample. In both cases, the algorithms terminate as soon as samples for each join value are found, and a hash join is used to combine the foreign key record with the primary key record, with the hash table constructed on the primary key table. In BitmapCombined, bitmaps for different join values were first combined using OR, then once a join attribute value had reached samples, its bitmap is subtracted from the combined bitmap.
For our join any algorithm, we ran the iterative algorithm presented in Section 7 and used DensityOptimal as our any algorithm in each iteration. We varied the number of blocks retrieved per iteration () before the updates to the combined densities, and evaluated its impact on the overall runtime. As with SharedScan and BitmapCombined, we used a hash join with the hash table constructed on the primary key table.
All experiments were run with a synthetic dataset using a SSD drive. The synthetic dataset had two tables: one for the primary key and one for the foreign key. All attributes for both tables were of int type. The primary key table’s th row had as the its unique primary key value, and the foreign key table’s foreign key attribute values were generated using a Zipf distribution. Note that this means there were some foreign keys which did not match with any primary key. In addition, we varied the following parameters: (1) the number of rows in the foreign key table, (2) the number of attributes in the foreign key table, (3) the number of attributes in the primary key table, (4) number of unique values for the join attribute in the primary key table (and thereby the number of the rows in the primary key table), and (5) the Zipf distribution parameter. Each experimental setup was run 5 times and the means of these average runtimes are reported in this paper. The standard deviation between the runtimes were less than 1% for the experiments, so they are not reported.
Table 3 shows the overall runtimes for different sampling rates with the following parameters: (1) 10 million rows for the foreign key table, (2) 10 foreign key key table attributes, (3) 10 primary key table attributes, (4) 10 unique join attribute values, and (5) 2 as the Zipf distribution parameter. The lowest runtimes for each sampling rate are highlighted in bold and the speedup relative to BitmapCombined (the better of the two baselines) are indicated in parentheses. As shown, our DensityCombined was the fastest algorithm for each sampling rate, with a 3 speedup compared to BitmapCombined and an order of magnitude difference with respect to SharedScan. This was largely due to the fact that DensityCombined retrieved far fewer blocks than either BitmapCombined or SharedScan. For a sampling rate of 0.05%, DensityCombined (=10) only retrieved 190 blocks, while BitmapCombined retrieved 1259 and SharedScan retrieved 1527. Since we used a SSD, this always resulted in a lower runtime. Note that had these experiments been run on a HDD, we would have used Hybrid as our any algorithm.
We found that varying had a rather minimal impact on the overall runtime for values between a certain range (5  50 for this case). Outside of this range (e.g., ), had a larger impact on performance, but the overall runtime was still lower than either BitmapCombined or SharedScan.
With the experimental setup used for Table 3, we varied each of the 5 parameters mentioned before one at a time to see their effect on overall runtime performance. Other than the varied parameter for each experiment, the other parameters were set to be the same as those used in the experiment for Table 3.
(1) Rows in Foreign Key Table. First, we wanted to see whether our iterative join any algorithm could scale to different dataset sizes. Since, the size of the primary key table is fixed to be the number of unique join attribute values, we first focused on varying the number of rows in the foreign key table. Tables 4 and 5 show the results for 5 and 50 million rows in the foreign key table respectively. As we can see DensityCombined still provided a speedup of 23 over BitmapCombined, suggesting that our join any algorithm can scale. The reasons for the speedup were the same as for Table 3; much fewer blocks were retrieved by DensityCombined than BitmapCombined.
(2) Number of Attributes in Foreign Key Table. Given the roworiented layout of our data, a change in the number of attributes in the foreign key table meant a change in the number of records per block for the foreign table. This parameter allowed us to observe how DensityCombined would adapt to different numbers of records in the blocks. Furthermore, the number of attributes also affected the size of the dataset, so this experiment served as an additional check on how well DensityCombined scaled with size. Tables 6 and 7 show the results for 5 and 50 attributes in the foreign key table respectively. DensityCombined still remained faster than either SharedScan or BitmapCombined, and we saw that as the number of attributes increased, the speedup became more pronounced as well. This was due to DensityCombined being more selective with the blocks it chose to retrieve. When there were fewer records per block, the choice of the blocks had a large impact on the number of blocks fetched, making DensityCombined more suited for this case than either BitmapCombined or SharedScan.
(3) Number of Attributes in Primary Key Table. We hypothesized that varying the number of attributes in the primary key table would have minimal effect on the overall runtime. The only impact this variable should have had was on the time it took to copy the record in the primary key table for the output. Tables 8 and 9 show the results for 5 and 50 attributes in the primary key table respectively. As we expected, this parameter did have a minimal impact on the overall performance of the algorithms, with runtimes extremely similar to Table 3. We believe given the magnitude of the difference, the runtime discrepancies between Table 3 and 8 were due to experimental noise.
(4) Number of Unique Join Attribute Values. We wanted to see how DensityCombined would perform with different number of join values, so we altered the number the number of unique join attribute values, and thereby also increased the number of rows in the primary key table. Table 10 and 11 show the results for 5 and 50 unique join attribute values respectively. As expected, DensityCombined outperformed BitmapCombined for 5 unique join values. Interestingly, BitmapCombined was more performant than DensityCombined for 50 unique join values. Upon closer examination, we found that although BitmapCombined was faster than DensityCombined, DensityCombined was still retrieving fewer blocks as shown by Table 12. However, compared to the other experiments, DensityCombined was returning a larger ratio of blocks with respect to BitmapCombined. Due to the Zipf distribution nature of the foreign key values, records with a foreign key value greater than 10 were scarce, and more spread out among the blocks. This meant that a greater number of blocks would have to be returned to satisfy the users’ join any query (regardless of any algorithm used). Since we ran the all these experiments with DensityOptimal, we believe that DensityCombined’s lack of awareness caused a greater overall runtime.
(5) Zipf Distribution Parameter. The final variable of interest was the Zipf distribution parameter used to generate the attribute values for the foreign key. Table 13 and 14 show the results for a Zipf distribution parameter of 1.5 and 5 respectively. DensityCombined outperformed SharedScan and BitmapCombined as usual for a Zipf distribution parameter of 1.5, but BitmapCombined once again outperformed DensityCombined for a Zipf distribution parameter of 5. A greater Zipf distribution parameter forced the foreign keys to be more heavily concentrated around the lower numbers, thus causing the higher numbers to become more scarce. Thus, a similar behavior to the experiment with 50 unique join attribute values was exhibited. We can see this from Table 15, in which DensityCombined still retrieved the “fewest” number of blocks, but it was only 1 or 2 less than BitmapCombined and SharedScan. When retrieving around the same number of blocks, the localityunaware DensityCombined expectedly performed worse than BitmapCombined.
Overall. These experiments suggest that the iterative join any algorithm is most effective when a smaller number of blocks needs to be fetched. Luckily, most browsing cases fit into such a category (e.g., a user is not likely to want 10 samples for each of the 50 unique join values), so NeedleTail ends up being a good fit for the browsing use case. Nevertheless, we are still in the midst of trying to make DensityCombined more aware of locality so that it can handle any situation.
10 Related Work
Prior work related to N