IDEBench: A Benchmark for Interactive Data Exploration
Existing benchmarks for analytical database systems such as TPC-DS and TPC-H are designed for static reporting scenarios. The main metric of these benchmarks is the performance of running individual SQL queries over a synthetic database. In this paper, we argue that such benchmarks are not suitable for evaluating database workloads originating from interactive data exploration (IDE) systems where most queries are ad-hoc, not based on predefined reports, and built incrementally.
As a main contribution, we present a novel benchmark called IDEBench that can be used to evaluate the performance of database systems for IDE workloads. As opposed to traditional benchmarks for analytical database systems, our goal is to provide more meaningful workloads and datasets that can be used to benchmark IDE query engines, with a particular focus on metrics that capture the trade-off between query performance and quality of the result. As a second contribution, this paper evaluates and discusses the performance results of selected IDE query engines using our benchmark. The study includes two commercial systems, as well as two research prototypes (IDEA, approXimateDB/XDB), and one traditional analytical database system (MonetDB).
IDEBench: A Benchmark for Interactive Data Exploration
Motivation: There is an ever growing need for systems which allow data scientist of varying skills levels to interactively and often visually explore large structured data sets. Unfortunately, traditional analytical database systems, such as MonetDB[?] or SAP HANA [?], usually do not provide the main property these systems require: sub-second response times for ad-hoc queries created by and for a visual interface [?, ?, ?]. As a result many existing interactive data exploration (IDE) systems started to create their own specialized query execution engines. For example, Tableau has its own SQL engine [?], Vizdom [?] has the Interactive Data Exploration Accelerator IDEA [?] as a seperate execution engine, and imMens [?] uses a special cube-based representation and query engine. In addition, there has also been attempts to build approximate query engines independent of the interface, such as SnappyData [?] or DICE [?, ?], with the goal to support interactive data exploration. All those systems have in common that they aim to provide better support for IDE workloads often by taking advantage of the fact that (1) users use visual tools to incrementally create more and more complex queries and that (2) good approximate answers returned in seconds are better than precise answers in minutes.
However, those systems widely differ in the techniques they use and the trade-offs they make. For example, IDEA uses specialized data structures and on-demand creation of stratified samples, whereas DICE does speculative pre-computation based on previous interactions. Furthermore, different engines implement different execution models (blocking, approximate, progressive), and might or might not re-use previously computed results [?, ?, ?, ?]. Additionally, IDE systems need to make a trade-off between the amount of pre-processing the system requires (i.e., how long does a user have to wait before they can explore a new data set) versus the quality of the first approximate answers. Even worse, different systems might have different trade-offs in regard to time vs. quality; some systems might aim to provide a first answer in 500ms, the general interactivity threshold [?], other systems will only return a result when the quality of the answer reaches some threshold, whereas others aim for progressive results in which the quality of an answer improves as longer the user is willing to wait.
This variety of goals and trade-offs make it extremely hard to determine which system is better for a particular data exploration task. For example, how much advantage do pre-computed stratified samples provide. How much overhead do approximate query processing techniques introduce? When would a traditional system like MonetDB not simply outperform an approximate engine without having even approximate answers. Or even more mundane, which of two approximate query engines is better.
Unfortunately, traditional analytical benchmarks can barely be used to answer any of these questions.
For instance, both TPC-H [?] and TPC-DS [?] use the time per query as the main metric, thereby not taking into consideration that results might be approximate.
Di Tria et al. [?] developed an extension for TPC-H that takes the accuracy of approximates result into account, but does not factor in confidence interval in their metrics.
More importantly, these benchmarks assume a fixed upfront known workload of complex queries, whereas in IDE queries are ad-hoc and built incrementally.
Finally, most existing analytical benchmarks assume that only one query at the time runs, whereas visual interfaces often create several queries, which run simultaneous.
Contributions: The first contribution of this paper is a novel benchmark called IDEBench that can be used to evaluate the performance of database systems for IDE workloads under realistic conditions in a standardized, automated, and re-producible way. Unlike the static workloads of traditional benchmarks, IDEBench measures the performance of IDE systems over the course of workloads that are closer to real data exploration workflows.
One of the key challenges in defining an IDE benchmark is to decide what a typical workflow constitutes. We therefore build upon previous user studies [?, ?] and derived three common IDE browsing patterns ranging from independent browsing, where users investigate the distribution of attributes of a dataset and specify arbitrary filters, to more targeted scenarios, where users want to answer a specific question. Yet, our goal is NOT to simulate users, which is arguably impossible. As Hawking puts it, “Intelligence is the ability to adapt to change” and we are far away from being able to simulate users, which can react intelligently to varying quality of query results, build or loose trust in a system based on answers, or come up with new conclusions or exploration paths based on particular insights. Similarly, our goal is not to benchmark the effectiveness of user interfaces or other visual components, such as visual recommendations [?].
Rather our goal is to provide meaningful abstractions, which provides a first step towards benchmarking IDE query engines with a particular focus on metrics that capture the trade-off between query performance and quality of the result. Furthermore, we created a highly customizable benchmarking framework, which allows research groups to change different benchmarking settings to their envisioned user scenario while still enabling a high degree of reproducibility and comparability.
The second contribution of this paper is a comprehensive study of running the benchmark on different database engines. The study includes two IDE engines as well as two research prototypes (IDEA, approXimateDB/XDB) and one traditional analytical database system (MonetDB). The benchmark code is available for download at http://idebench.github.io.
Outline: The remainder of the paper is organized as follows: In Section IDEBench: A Benchmark for Interactive Data Exploration, we first reiterate the current landscape of IDE systems and their typical workload characteristics. We then discuss shortcomings of existing IDE evaluation approaches and derive a set of requirements for an IDE benchmark in Section IDEBench: A Benchmark for Interactive Data Exploration. Section IDEBench: A Benchmark for Interactive Data Exploration presents the design of IDEBench and discusses the workload and data generator, as well as the metrics and reporting requirements. In Section \thefigure we present the results of running the benchmark on five database systems; two commercial systems, two research prototypes (IDEA, approXimateDB/XDB), and one traditional analytical database system (MonetDB). Finally, we discuss the most important findings of the benchmark results and discuss future work in Section IDEBench: A Benchmark for Interactive Data Exploration.
Based on findings in recent user studies [?, ?] we first describe a use case that anecdotally exemplifies common visual IDE frontends and then summarize typical workload characteristics that emerge for database systems that are triggered by these frontends. We then provide an overview of IDE systems which are often used in practice, and discuss why existing evaluation approaches are not suitable to compare these systems.
Imagine Jean: a research staff member at a major hospital. She wants to get an overview of the hospital’s patient population and their health problems. To do so, she looks at electronic health records from the past 20 years. Jean starts out by examining demographic information of patients and, for example, finds that patients ages are normally distributed. She then continues to look for interesting patterns in admission times and dates. Jean creates a query that shows the number of new admits per hour of the day. The result reveals that most admits are during business hours, but there is an interesting bump from 7 to 10pm. She filters down to admits coming from the emergency center and notices that most of the admits between 7 and 10pm are coming from there. Is this trend identical on all days of the week? She refines her query to only show the admits on weekends and sees that the previous evening bump now shifted towards 10 to 12pm. Who are these patients? Jean filters her previous age query by patients admitted on weekends between 10 and 12pm. She finds that patients ranging from 20 to 35 are over represented in this subset when compared to the overall age distribution. Now Jean wants to see which health problems are common among this sub-population. She finds that head traumas are fairly frequent and decides to check with the administration if the hospital’s duty rota accommodates for this by making sure a trauma specialist is on call during weekend nights.
Based on this exemplary use case, we now discuss the most important workload characteristics that emerge for database systems.
IDE workloads are dominated by OLAP-style aggregation queries and typically follow the “Visual Information Seeking Mantra” [?] of “Overview first, zoom and filter, then details-on-demand”. Since the result of queries are typically visualized by an IDE frontend, most queries group the data by one or many attributes and apply aggregate functions to each group such as AVG, or SUM.
When dealing with large datasets, however, visualization systems commonly bin the data by some definition in order to compute aggregate results instead of just simply grouping on a set of attributes. Binning can be found for a wide-range of visualizations such as histograms and their siblings, pie charts, choropleth maps, or bubble charts. Binned plots are omnipresent and featured in virtually any visualization software. Figure IDEBench: A Benchmark for Interactive Data Exploration illustrates this with screenshots of three different IDE frontends: imMens, Tableau and Vizdom. Defining the binning behavior of an IDE frontend and thus for a benchmark, which includes the dimensionality of a bin (e.g., 1D for a histogram, 2D for a binned scatter plot), and the bin boundaries is not trivial. When the distribution of data for a visualization is unknown, there are generally two methods how bin boundaries can be defined: 1) they can be specified by either choosing a pre-defined number of bins, which, for quantitative values requires a computation of the current minimum and maximum value of each bin. 2) they could be defined by choosing a interval based on a fixed bin width and a reference value.
Incremental Query Building and Think Time
As illustrated in the use case, IDE frontends are often used in an iterative process where queries are constructed ad-hoc and are refined incrementally. For example, users typically start looking at all data and then narrow down the search to more interesting details.
Furthermore, visualizations in IDE frontends are often linked together. Linking refers to setting the data source of a visualization (target) to the data source of another visualization (source). When data of a source visualization is either filtered or selected, either the source and the target, or just the target visualization are forced to update. This is illustrated in Figure IDEBench: A Benchmark for Interactive Data Exploration where changing the selection in one visualization updates all other visualizations. Linking on the query level often represents a join, since attributes of different tables might be connected that way.
Finally, user interactions are typically separated by a think-time, during which users analyze results and decide on what to do next.
Multiple Concurrent Queries
Many existing visual tools for IDE e.g. [?, ?, ?] provide user interfaces to informally create and layout visualizations of different subsets of the data, apply (cross-) filtering, and to perform “linking, brushing and zooming” operations [?, ?]. Typically, this allows users to look at different facets of a data set at the same time.
Consider Figure IDEBench: A Benchmark for Interactive Data Exploration: each of the depicted applications [?, ?, ?] displays linked visualizations that users can utilize to simultaneously brush or filter other visualizations. Such links are either implicitly created by the application or database schema, or explicitly defined by the user.
In more abstract terms, dash-boards built by users using an IDE frontend can be seen as dependency graphs of visualization and filter objects. Changing properties of either object may require all dependent visualization to update, which on the database-level leads to multiple concurrent queries per interaction.
Several commercial and academic database systems aim to support interactive data exploration workloads. In the following we summarize this landscape through three categories and provide examples for each.
Analytical Database Systems This category represents classic database systems that efficiently execute aggregate queries to completion and then return full result. This includes column-stores and main-memory systems such as MonetDB [?], SAP HANA [?], Hyper [?] as well as database management systems that are designed for online analytical processing (OLAP) type workloads [?]. These systems cannot guarantee interactive response times on large data sets. However, since they are often used as backends for many visual data analytics tools such as Tableau, we include them in our experiments as a baseline.
Approximate Database Systems This category of systems also targets aggregate queries. Contrary to classic analytical database systems these tools either use offline or online sampling techniques to return an approximate answer without scanning the entire dataset. That way, these systems often can better guarantee interactive response times even on large data sets. Examples in this category include AQUA [?], Snappy Data, BlinkDB [?], and ApproXimate DB (XDB) [?]. Individual systems support different ways of how users interact with them; some require users to set a desired result quality per query or a fixed time constraint.
Specialized Engines for IDE Systems in this category represent specialized engines for IDE that often come with their own custom user interface. A prominent commercial example is the backend of Tableau [?] and its research predecessor Polaris [?], which uses a specialized engine for visual data analysis workloads. imMens [?] is another example that includes a specialized engine heavily leverages pre-computation over the possible query space to enable interactivity. IDEA [?] is a backend used for a pen-and-touch interface. It uses online sampling techniques to progressively compute query results and push them to the user interface on request. DICE [?] is a backend system optimized for exploratory cube analysis and leverages interaction delays (i.e., “think-times”) and a user interaction model to predict future queries.
Recent position papers [?, ?] have advocated for an IDE benchmark. In this section we discuss the scope of our work and the requirements we believe to be crucial for a benchmark for IDE backend systems.
Ideally workloads are close to the database workload generated by the behavior of real users in a variety of different data exploration scenarios. While we acknowledge the breadth and richness of different IDE tasks and user interfaces (UI), this work does not attempt to model exact human behavior for its workloads.
Instead we follow a more pragmatic, UI-agnostic approach: we limit the scope of IDEBench to common exploration patterns which can be translated to workloads on the database backend. As previously discussed, we observed these patterns in a range of user studies (e.g., [?, ?]), and empirically found them to be supported by many other modern IDE systems (see Figure IDEBench: A Benchmark for Interactive Data Exploration). To that end, IDEBench focuses on aggregate queries on large datasets where queries are built and refined incrementally (separated by a think-time), result are (cross)-filtered between tables, and applying filters on linked visualizations can result in multiple concurrent queries.
Furthermore, in its current version IDEBench targets data warehouse star schemas that are often used in analytical scenarios in both de-normalized and normalized form. In future, we plan to extend the benchmark to more rich database schemata as well.
In the following we discuss the workload, the metrics, and the data that a new benchmark such as IDEBench should fulfill.
As discussed before, we believe that a new benchmark should provide a workload that could result from plausible user behaviours (see Section IDEBench: A Benchmark for Interactive Data Exploration) where actions to build and modify a visualizations result in queries to an IDE backend. As discussed before, we therefore require that the workload not only runs individual queries but resemble workload characteristics discussed in Section IDEBench: A Benchmark for Interactive Data Exploration that can be triggered by user interacting with an visual IDE frontend. This includes actions to create new visualizations which triggers new queries, actions that add filters to incrementally select a sub-populations, link visualizations where a single action can force multiple visualizations to update. Furthermore, when queries in the benchmark are executed, there must be delays between queries triggered by consecutive user interactions, and the queries by the simulation must be aggregation queries with different parameterizations, i.e., different binning strategies and aggregate functions. Lastly, it is crucial for workloads to be applicable to the wide range of database backends, as discussed in Section IDEBench: A Benchmark for Interactive Data Exploration.
Metrics used to evaluate the performance of an IDE system should capture their generative power for insights, and must be applicable to IDE systems which return exact or progressive/approximate results. We believe that this can be reduced to two aspects of IDE backends. (1) Speed: a recent study [?] shows that higher query latencies negatively affects users and their ability to derive insights from data. (2) Quality of the results: while approximate and progressive systems are able to maintain low latencies, intermediate results may vary from the ground-truth, or could be returned with low confidence (large confidence intervals). Because there is a trade-off between speed and quality, we believe that the metrics for an IDE benchmark should reflect the quality of the results for a given interactivity/time requirement (quality after seconds). Important here is that the metrics should also be applicable to classic analytical database systems, where the query results are always exact but the query execution may be slow since results are only returned upon query completion.
An IDE benchmark should support different datasets with different scale factors. As some IDE systems may perform better on certain data distributions than others, it is important that the attributes in the dataset exhibit different types of distributions, and contain random as well as correlated data. Furthermore, systems like approXimateDB [?] implement online joins while classic systems such as MonetDB [?] typically use a blocking join such as a radix hash-join. To be able to measure the effect of such joining techniques, an IDE benchmark must support different schema complexities, i.e., normalized as well as de-normalized star schemas.
We believe that in order for an IDE benchmark to be adapted by the community, it is crucial that workloads and datasets can be customized to the use case of an IDE system. A benchmark should therefore facilitate the ability to create workloads based on modifiable configurations, and to scale any seed datasets to an arbitrary size while preserving the original distributions.
In this section we explain how we designed IDEBench along the requirements for workloads, data and metrics, and describe how the results of the final metrics are computed and reported.
IDEBench comprises three main components: 1) A data generator that scales any seed dataset to an arbitrary size. 2) A workload generator that create sequences of interactions, which we henceforth refer to as “workflows”. 3) An benchmark driver, which runs/simulates workflows, delegates interactions to system drivers, and generates reports. 4) Different system adapters, which are custom proxies between the database systems under test and the interpreter which runs a workload. The drivers are responsible to translate the benchmark workload into queries supported by a system, and returns computed results back to the interpreter.
The default configuration of IDEBench uses a real-world data set containing U.S. domestic flights [?] (see Figure IDEBench: A Benchmark for Interactive Data Exploration).
We use this dataset instead of existing data generators from TPC-H or TPC-DS since it contains real-world data and distributions. This is important because the underlying distributions can affect quality of results, especially in the case of database systems that use approximate query processing techniques. In the default configuration, the benchmark use this dataset to test the systems. Alternatively, users can use any other dataset to customize the benchmark.
In order to scale the default data set (but also custom data sets), IDEBench comes with a data generator uses a seed datasets to create a new dataset of arbitrary size. The generator tries to maintain distributions in the data and relationships between attributes when scaling. It also supports the transformation of data into a more normalized form (e.g., one fact and multiple dimension tables) based on a specification given by the user. In the default configuration, the benchmark uses three data sizes (S, M, and L) for each data set to test the runtime for increasing data sets. We elaborate more on the concrete data set sizes in the evaluation section.
The data scaling procedure to scale a data set works as follows: From the seed dataset we first create a random sample. We then compute the covariance matrix and perform the Cholesky decomposition on . To create a new tuple, we first generate a vector of random normal variables and induce correlation by computing . We then transform to uniform distribution and finally use the CDF from our sample to transform the uniform variables to a correlated tuple. Optionally, as a last step the data generator then vertically partitions the data into multiple tables (normalization) based on a user-given schema specification.
Unlike the static workloads of traditional benchmarks, IDEBench aims to measure the performance of database workloads that result from interactive data exploration frontends over the course of entire user-centered workflows (see Section IDEBench: A Benchmark for Interactive Data Exploration). In these workloads it is common that queries are built and refined incrementally, executed with think-times between queries rather than being processed back-to-back, and that oftentimes multiple queries need to be processed simultaneously in order to update multiple linked visualizations.
In order to reflect this in our benchmark, IDEBench workloads simulate user interactions that are typical interactions in IDE frontends. We have implemented a workflow generator that allows users of IDEBench to create custom workflows of for any of the four types below. Workflows are sequences of common interactions that resemble IDE interaction patterns of real users (see Figure IDEBench: A Benchmark for Interactive Data Exploration. The workflow generator models workflows as Markov Chains with pre-defined (and customizable) probability distributions for each of the workflow types to sample a sequence of interactions and filter/selection criteria. We base the different types of workloads and probability distributions of various interactions on observations made by analyzing the logs and videos of past user studies [?, ?].
Independent Browsing (Figure IDEBench: A Benchmark for Interactive Data Explorationa), where users explore a dataset by creating visualizations of different dimensions of the data and by applying filters that only affect a single visualization at the time. This is often applied by users to get a first quick overview of the data.
Sequential Linking (Figure IDEBench: A Benchmark for Interactive Data Explorationb), where users create multiple visualizations that are (logically) sequentially linked. Filters and selections on these visualizations trigger multiple concurrent queries to update all affected visualizations. This type of workflow is often used for targeted explorations where users drill down in the data to verify one concrete hypothesis.
1:N Linking (Figure IDEBench: A Benchmark for Interactive Data Explorationc), where filtering/selecting on one visualization triggers other queries for directly linked visualizations. This type of workflow is often used to see how different subsets of the data matching certain criteria affect visualizations of other dimensions of the data.
N:1 Linking (Figure IDEBench: A Benchmark for Interactive Data Explorationd), where applying a filter to any one of visualization affects a single directly linked visualization. This is often used to incrementally build filter expressions involving multiple dimensions.
These patterns determine the workload for the database system regarding the fact of how many queries are triggered simultaneously. For example, for the independent browsing type one user interaction results on in one query since only one visualization needs to be updated. In contrast, for the 1:N linking, one user interaction (e.g., changing the filter) on a visualization early in the sequence can trigger many queries since multiple visualization potentially need to be updated.
Each generated workflow comprises a sequence of interactions performed by users: Creating a visualization i.e., formulating and executing query (interaction 1, 3, 4 in Figure IDEBench: A Benchmark for Interactive Data Exploration), filtering/selecting (interaction 2 and 6), linking visualizations (interaction 5), and discarding a visualization (not shown). Alternatively, customized workflows can additionally be built manually to match a specific usage scenario. However, they are not part of the default configuration of IDEBench. Once generated, they can be inspected with an interactive viewer.
The core of IDEBench is an benchmark driver, a simple command line application (written in Python) configured to load and simulate workflows by forwarding interactions to system adapters (see Section IDEBench: A Benchmark for Interactive Data Exploration), and to evaluate the results that are fed back into the driver. When running a workflow the driver keeps track of a visualization graph (similar to IDEBench: A Benchmark for Interactive Data Exploration), notifying an adapter about which interactions have been executed. These interactions are specified in a JSON-based format (see Figure IDEBench: A Benchmark for Interactive Data Exploration). The benchmark driver automatically translates queries to SQL, or alternatively, lets the system driver translate queries into a language compatible with the system being evaluated.
To be evaluated by our benchmark a system needs to implement a driver interface that acts as proxy between the benchmark and the system under test. The benchmark driver delegates interactions from the workflows to the system adapter. The system adapter takes these interactions and translates them into queries executable by the system under evaluation (e.g., SQL in case of a classical analytical database system). However, any other query language or proprietary API can be used by implementing an additional adapter.
In the following, we show the most important parameters that our benchmark uses to test different configurations.
|Time Requirement (TR)||The maximum execution duration for a query.|
|Dataset and Size||The dataset to run the benchmark on and the number of tuples to up- or downsample dataset to.|
|Think Time||The delay between two consecutive interactions.|
|Using Joins||Whether a normalized or de-normalized schema is used.|
|Confidence Level||The confidence level at which an AQP returns margins of error.|
IDEBench provides default configurations for these parameters but we also allow users of our benchmark to vary the parameters such that they can test a system with settings that match the requirements of their application. However, in order to enable comparability of benchmark results the default configurations should be used. The details of the default configurations are listed in the evaluation section.
Various researchers have identified the interaction response time as a crucial factor in interactive data exploration systems [?, ?, ?]. Liu et. al. [?], for instance, showed that even response times of about ms could lead to poor user performance. Studies of other tasks argue for even lower thresholds [?, ?]. In general, it is important that the speed of the systems aligns with the users’ speed of interaction. Acceptable response times may vary depending on the domain and task. For example [?, ?, ?, ?] claim that responses within one second allow users to stay focused, while response times over ten seconds exceed the average attention span. In our experiments, we used time requirements of s, s, s, s and s.
Dataset and Size This parameter represents the dataset to be used in the benchmark as well as its size in de-normalized form (the size of the fact table). Normalization of the data is applied as a post processing step after generation.
Think Time User studies have shown that there are significant delays (think time) between two consecutive interactions [?], which systems can leverage to run speculative queries. For the benchmark we recommend values between 3 and 10 seconds.
Using Joins While many systems only support queries on data in de-normalized form, e.g. [?, ?], IDEBench can also be run to measure query performance on more normalized datasets (e.g., represented as a star schema) .
Confidence Level Many approximate query processing (AQP) systems are capable of returning confidence interval for approximate results, and let users define the desired confidence level. IDEBench uses % as default.
As discussed before, the metrics of our benchmark should reflect the interactivity and the quality of the results throughout all operations of the executed workflows. For each executed query we evaluate the following metrics and aggregate them into a final report (see next section):
Time Requirement Violated Time Requirement (TR) Violated is a boolean value indicating whether or not a query violated the time requirement specified in the settings (see Section 1). TR is violated if time TR after initiating the query, no result is present or can be fetched. In practice this means, that for batch-processing and APQ systems, TR is violated if the run-time of a query is greater than TR, and no intermediate result is present. For progressive system, TR is violated if time TR after initiating a query no result can be fetched. IDEBench measures a boolean rather than the actual duration as time violation in order to guarantee constant run-times for any workflow; queries whose run-time exceed TR are cancelled.
|Time Requirement (TR) Violated||Boolean whether a query violated the time requirement Time Requirement (TR).|
|Missing Bins||The ratio of the number of bins bins for which no result has been delivered and the total number of bins in the ground-truth.|
|Mean Relative Error||The mean relative error of all bins returned in the result (see definition below).|
|Cosine Distance||A measure of how much the “shape” of a result resembles the ground-truth.|
|Mean Margin of Error||The mean of all relative margins of error for all bins.|
|Out of Margin||The number of approximate results that were outside of the return confidence interval.|
|Bias||The sum of all returned values in a result divided by the sum of all true results for the bins returned.|
Missing Bins/Groups Missing Bins/Groups is the ratio of all bins for which no result has been delivered, and all bins in the ground-truth. It is a measure of completeness for an aggregate query result, irrespective of the number of tuples processed by a system.
Mean Relative Error To measure the error between a result of an aggregate query and its ground-truth we compute the relative error; i.e., the ratio between the difference on the estimated result and the actual result .
We use the mean relative error due to its popularity in existing literature and ease of interpretation. However, it is important to note that the relative error is not defined for any , which is especially problematic for aggregate functions such as AVG, MIN/MAX, or SUM, if the expected value is zero. A possible future alternative is the Symmetric Mean Absolute Percentage Error, which is defined as:
While less intuitive, SMAPE is defined for , unless , in which case the error is . SMAPE is also bounded at 0 and 1, which may simplify interpretation across different experiments.
Cosine Distance In some cases users may be more interested in the relative difference of aggregated results, i.e. the distribution of values, rather than the true values. We measure the cosine distance to test how much the “shape” of a result deviates from its ground-truth. For instance, it captures if a system is able to provide a good estimate of the relative frequency distribution of the data, even if the relative errors are high. To make sure both result vectors and are of equal length, we set the value at each missing bin to zero.
Mean Margin of Error Approximate and progressive query system typically provide confidence intervals with their query results. To get a sense of how tight these intervals are, i.e. how likely the returned result was just a good guess, we compute the mean and standard deviation of all relative margins of error.
Out of Margin Out of Margin is a sanity check to test whether the system returns results roughly at the confidence level specified in the settings. We measure the number of how many of the per-bin results exceeded the returned margins of error.
Bias Indicates whether a system tends to over or under-estimate aggregated values. This metric becomes especially important if other error metrics are employed, as some (e.g., SMAPE) penalize under/over-estimation unequally.
Upon completion of running the benchmark, IDEBench generates two reports: 1) An aggregated summary report listing how frequently the time requirement was violated, how many bins are missing on average, and the distribution of mean relative errors for all queries which did not violated the time requirement. Figure IDEBench: A Benchmark for Interactive Data Exploration shows an example of such a summary report. 2) A detailed report listing all settings and metrics on a per query basis.
Ideally, a user wanting to explore a new dataset can effortlessly plug in the data. Therefore, users of IDEBench are required to report on all actions needed to be taken to prepare for a benchmark run (called “data preparation time” in our report); i.e., the time from connecting to a new data source to being actually able to start running the workload. This includes steps and time taken to copy the dataset into the system to create sample tables/views offline, perform pre-processing, warm-up queries, etc.
To demonstrate the applicability of IDEBench across various systems types, we executed the benchmark on the following systems: (1) MonetDB: A state-of-the-art open-source analytical column-store DBMS, which uses a blocking query execution model that requires users to wait until an exact query result is computed. Thus, upon initiating a query, the run-time of the query is unknown. (2) approXimateDB/XDB: A PostgreSQL-based DBMS that supports online aggregation using the wander join algorithm [?]. It allows for a maximum run-time to be set when initiating a query. It additionally supports a “report interval”, so that intermediate results can be retrieved at fixed time intervals. XDB has some limitations in terms of query support, which we describe in detail in Section IDEBench: A Benchmark for Interactive Data Exploration. (3) IDEA: A system that supports online aggregation and has a fully progressive computation model where, after initiating a query, results can be polled at any point in time. (4) System X: A commercial in-memory AQP system that operates on stratified sample tables (offline sampling). The run time of queries cannot be set explicitly, but must be specified by means of setting the size of samples tables, i.e. the sampling rate. (5) System Y: A commercial specialized engine for IDE, which provides an in-memory optimization layer on top of a number of DBMS systems.
In the remainder of this section, we first describe the general setup of all our default configurations (Section IDEBench: A Benchmark for Interactive Data Exploration), report on the data preparation time of each of the systems described above (Section IDEBench: A Benchmark for Interactive Data Exploration) and present the overall benchmark performance (Section IDEBench: A Benchmark for Interactive Data Exploration) as well as experiments with benchmark settings (Section IDEBench: A Benchmark for Interactive Data Exploration to Section IDEBench: A Benchmark for Interactive Data Exploration).
Default Configurations In the following, we discuss the default configurations that our benchmark defines. In the default configuration, IDEBench uses the flight dataset (see Section IDEBench: A Benchmark for Interactive Data Exploration) with S=100 million, M=500 million, and L=1 billion tuples in the de-normalized form (i.e., only one large table with all attributes). Moreover, the default configuration runs workflows for each of the workflow types described in Section IDEBench: A Benchmark for Interactive Data Exploration, as well as “mixed” workflows which exhibit usage patterns from all four workflow types. As parameters in the default configuration, we use five different time requirements 0.5s, 1s, 3s, 5s and 10s, with a confidence level set to 95%. While most recommended latencies are in the range of 0-1s, we also included greater ones in the default configuration to get a better understanding of how fast results converge. Finally, we use ten different think-times ranging from 1s to 10s (see Figure IDEBench: A Benchmark for Interactive Data Exploration) in our default configuration. We empirically found these to be good estimates by analyzing video logs of a previous user study [?].
Setup We ran the benchmark on MonetDB, approXimateDB, IDEA and System X. In order to stress-test the systems, we only report for the think-time of in all experiments except experiment 3 (see Section IDEBench: A Benchmark for Interactive Data Exploration), in which we analyzed the effect of varying the think-time. We did not run the full benchmark on System Y since it does not have a publicly available API. For System Y we executed selected workflows manually over its user interface in a separate experiment (see Section IDEBench: A Benchmark for Interactive Data Exploration). All experiments were conducted on a computer with two Intel E5-2660 CPUs (2.2GHz, 10 cores, 25MB cache) and 256GB RAM. We use the default configuration for all systems, and abstain from optimizations for any of the system parameters.
In our main experiment (Figure IDEBench: A Benchmark for Interactive Data Exploration and IDEBench: A Benchmark for Interactive Data Exploration) we analyzed how the four systems behave with respect to different time requirements (see Figure IDEBench: A Benchmark for Interactive Data Exploration). We show the results the mixed workload, i.e., 10 workflows with a mix of the four exploration behavior described in Section IDEBench: A Benchmark for Interactive Data Exploration, data size of 500M, a de-normalized schema, and a confidence level to 95%.
Data Preparation Time In MonetDB data stored in a CSV file can be loaded into the database through an SQL interface, which takes 19 minutes for 500M records. There is no pre-processing time upon starting the server. approXimateDB works identically, but takes 130min (time split between adding the data and adding a primary key). This system also provides support for an additional SQL statement to pre-load relations and indexes into the database buffer in main memory. For our experiments we did not make use of this statement. IDEA expects data in a single CSV file and does not need any pre-processing. On start-up, IDEA by default loads a fixed amount of tuples into main memory, which takes 3min. In System X data stored in a CSV file can be loaded into the database through a SQL interface. In order to be able to execute approximate queries, stratified sample tables have to be created offline. We used a sample size of 1% of the data size. System X further requires upon restart of the systems that each connection must execute a warm-up query. For 500M records we measured a data preperation time of 27min.
Speed and Quality Metrics As expected for an exact execution model, MonetDB’s TR (Time Requirement) violations decrease roughly linearly with the defined TR, and so does the proportion of missing bins (see also Figure IDEBench: A Benchmark for Interactive Data Explorationa). approXimateDB, on the other hand, violates the time requirement consistently around 66% for any TR. This is due to the fact that while approXimateDB supports online aggregation for COUNT and SUM, it does not provide online support for AVG nor for multiple aggregates in a single query. Thus, queries that cannot be executed online typically fail for a small TR. We therefore set up approXimateDB so that any query that cannot be executed online will fall back to a regular Postgres query. With System X more than 50% of all queries violate TR=0.5s. Interestingly, though, for TR=1s only 5% are violated, and for TR=3s all query results are returned on time. The percentage where TR is violated is therefore a good indicator of how large one should set the sample size, if speed is more important than result quality. IDEA on other hand does not violate any TR, with the exception of 1% of all queries for TR=0.5. The authors confirmed that this is due a slightly higher overhead for the first query after a restart of the system. IDEA also starts off with significantly less missing bins (37%) for TR=0.5 than any other system, but achieves similar values to System X for TR¿=1s. Furthermore, IDEA manages to perform better than other systems in terms of mean relative error of all return results. The median of all mean relative is constantly much lower than approXimateDB’s, and marginally lower than the one of System X. More interestingly, approXimateDB’s area above the curve is much higher than the one of IDEA and System X, indicating that high mean relative errors occur more frequently. A similar conclusion can be drawn by looking at the end of the curve. approXimateDB’s curve ends, below 50%, indicating that more than 50% of all mean relative error are greater than 100%.
Figure IDEBench: A Benchmark for Interactive Data Explorationb and IDEBench: A Benchmark for Interactive Data Explorationc show how the median of the mean relative margins, and the cosine distance develops with increasing time requirements. approXimateDB has significantly higher relative margins than both IDEA and System X. However, while System X’s median is close to 120% for TR=0.5s and drops to slightly above 20% for TR=1s, IDEA’s median remains constant around zero for all TRs. Figure IDEBench: A Benchmark for Interactive Data Explorationd compares how the proportion of missing bins differs based on which system and workflow type is used. As none of the systems we used in the evaluation use speculative execution by default, there are only few significant differences. For instance, MonetDB has fewer missing bins on average for independent browser and N:1 workflows, which may be attributed to the fact that any interaction of these workflows only trigger a single query.
In a third experiment, we compare the performance of MonetDB and approXimateDB using a normalized and de-normalized schema. We exclude IDEA as it does not support joins. Similarly, System X, only works on de-normalized data. We used our data generator to create two datasets of 100 million and 500 million tuples and normalized the data so that the fact table holds foreign keys to two dimension tables (airports and carriers). Interestingly, as can be seen in Figure IDEBench: A Benchmark for Interactive Data Exploratione, both MonetDB and approXimateDB perform slightly better in terms of time requirement violations with a normalized schema. We can attribute this to the fact that the overall size of all tables is reduced since splitting data into fact and dimension tables reduced the overall database size. Another interesting observation is that MonetDB’s proportion of TR violations grows with the size of the normalized dataset. Conversely, approXimateDB is able to keep it roughly at the same level, due to its online join support for aggregate queries.
In this experiment, we evaluated the impact of increasing think times between interactions (see Figure IDEBench: A Benchmark for Interactive Data Explorationf). We used an experimental extension of IDEA that speculatively executes queries when two visualizations are linked. For the setup we used a fixed data size of 500M tuples, a time requirement of 3s, and created a custom workflow comprising following four interactions. 1) data for a 2D count histogram (100 bins) of arrival delays vs. departure delays is requested. 2) data for a 1D count histogram (25 bins) of carriers is requested. 3) a link between the to visualizations is established, setting the 1D histogram as source and the 2D histogram as target 4) a single carrier is selected in the 1D histogram, forcing the 2D histogram to update
IDEA keeps track of a visualization graph, and executes queries for every possible single bin selection in the source visualization. If upon the next interaction one of the bins in the 1D histogram is selected, IDEA can return a potentially better estimate of the results, as the query has had more processing time. Figure IDEBench: A Benchmark for Interactive Data Explorationf shows the proportion of missing bins across ten think times (1s - 10s).
We also used the benchmark results to analyze each system for other effects, e.g., differences in performance in regards to bin widths/number of bins, binning types (1D vs .2D, nominal vs. quantitative ranges), as well as how systems respond to interactions that lead to multiple concurrently executed queries. We analyzed all queries listed in the detailed reports of all systems, but found no evidence that any of the factors above have a significant impact on the performance of any of the metrics. By far the most crucial factor in terms of query performance, seems to be the specificity of filter/selection predicates.
In the last experiment we replicated a selected subset of our workflows in a commercial IDE System Y and used MonetDB as a backend. We used a fixed data size of 500M and simulated three variants of the 1:N workflow type. In particular, we were interested to see if System X uses an intermediate layer that pre-fetches/computes results, similar to the experimental extension of IDEA (see Section IDEBench: A Benchmark for Interactive Data Exploration). However, we did not find this to be the case. System Y renders and updates the visualizations in the workload roughly at the same speed as when one uses MonetDB directly, with an added delay of about 1-2s per query. This is likely to be the rendering overhead to draw the visualizations.
Main Findings By implementing IDEBench for four systems we have shown that the performance in terms of data preperation time, TR violations as well as quality of the results can vary significantly. We saw that progressive and AQP systems like IDEA and System X were able to keep time violations at a minimum while maintaining low error rates with increasing data sizes and time requirements. This is in stark contrast to classical analytical databases represented by MonetDB where time violations increase for larger datasets and time requirements. We saw that approXimateDB can only execute a subset of the queries in our workload online. It has to revert to executing other queries in a blocking fashion, which in turn leads to significantly more TR violations. For AQP systems where sample tables need to be created offline (e.g., System X), quality of result metrics such as the relative error and missing bins remain constant across different time requirements. Thus, such a system would have to scan the full table or to create additional sample tables of different sizes in order to achieve a higher result quality. This in turn would increase the data preparation time. Furthermore, our results indicate at which point the use of an AQP that implements online sampling over offline sampling is beneficial; stratified sampling is able to provide results similar to online systems. However, there is significant overhead for offline sample-based approaches. Determining a “good” sample size to find a good trade-off between speed and quality of the results is time-consuming, and sample tables created offline cannot be fully optimized because in IDE the workload is unknown ahead of time.
Future Work The core idea of IDEBench is to provide a benchmark that simulates typical user behavior for basic tasks in IDE such as such as executing aggregate queries, filtering result sets, etc. [?, ?]. The richer the tasks the harder they become to benchmark. For the future we therefore envision an extensible benchmark design that defines different task specific core-sets where each core-set aims to analyze a different functional aspects; e.g., one core-set (as defined in this paper) only tests simple analytical operations, whereas another one tests more complex model building tasks. Which core-sets are used to evaluate a system depends on the supported functionality of that system.
Specifically, we envision for our benchmark four core-sets: Core-Set I focuses on Interactive Data Exploration as covered by the benchmark implementation presented in this paper. However, it excludes interactive model building, which is part of Core-Set II. For the future version of IDEBench, we believe that Core-Set I and II should be coupled as it seems unreasonable to assume that one would build a model without having the possibility to inspect the data set before. Core-Set III should then concerned with benchmarking recommendation engines which are often used to complement IDE systems by steering users to interesting parts of a new data set. Examples systems are SeeDB [?] or Data Polygamy [?]. Finally, for Core-Set IV we suggest to extend the benchmark to other data sets that allows us compare the interactive data cleaning capabilities covered by systems such as DataWrangler [?], Trifacta [?] or Paxata [?]).
We believe that there will be variety of systems that can only cover the functionality of some the core-sets above. We therefore envision that each core-set can be tested individually. The higher the core-set number the harder it is to define a benchmark since the sheer complexity of supported operations is increasing and their comparability becomes more difficult.
Finally, a last important aspect is to make the benchmarking code publicly available on the web [?] and include more recent benchmarking results on the already tested systems as well as results on other systems. Moreover, we plan to allow other research groups as well as industry to upload other data sets and user-defined workflows in the format that they can be included our framework to cover interesting aspects of other domains.
In this paper, we presented a new benchmark IDEBench for evaluating systems for interactive data exploration (IDE). IDEBench defines a new set of metrics and a workload generator to simulate different exploration behaviours of users as well as a data generator to better address the challenges of IDE workloads. Based on this new benchmark, we conducted an evaluation study that covered five different systems (approximateDB, IDEA, MonetDB as well as two commercial systems) of three different categories (traditional analytical database systems, approximate query processing engines, as well as specialized engines for IDE) that are used today in order to execute IDE workloads.
Table IDEBench: A Benchmark for Interactive Data Exploration shows an example of a detailed report generated by IDEBench. Each row describes a query and its evaluation results. id is a query identifier. interaction_id is a reference to the interaction a query is associated with; it is the index to an interaction in a workflow specification. viz_name is a reference to the visualization specification in a workflow. driver is the name of the driver used to run the benchmark. think_time, time_req, data_size refer to the benchmark settings (see Section 1). workflow is a the name of the workflow a query is part of. start_time and end_time are UNIX time stamps of when a query was initiated and when it returns/got cancelled. bins_dims indicate the number of binning dimensions in the visualization specification. binning_type indicate whether a nominal and/or quantitative bin range was used in the visualization specification. bins_ofm is a count of how many bins in the result of a query exceed the margin of error. bin_in_gt shows how many bins are in the ground-truth for a query. agg_type shows which aggregate function was used in the query specification. The remaining columns are the results for the metrics described in Section IDEBench: A Benchmark for Interactive Data Exploration. Note that rel_error_avg and margin_avg are the mean relative error/margin of error across all bins in a query result. The summary report (see Section IDEBench: A Benchmark for Interactive Data Exploration) computes the means of all metrics in the detailed report, aggregated on a per workflow-type basis.
- 1 S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. The aqua approximate query answering system. In ACM SIGMOD, pages 574–576, 1999.
- 2 S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. Blinkdb: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pages 29–42. ACM, 2013.
- 3 L. Battle, R. Chang, J. Heer, and M. Stonebraker. Position statement: The case for a visualization performance benchmark. IEEE Internet Computing, 13(3):48–55, 2009.
- 4 T. Beigbeder, R. Coughlan, C. Lusher, J. Plunkett, E. Agu, and M. Claypool. The effects of loss and latency on user performance in unreal tournament 2003®. In Proceedings of 3rd ACM SIGCOMM workshop on Network and system support for games, pages 144–151. ACM, 2004.
- 5 J. Brutlag. Speed matters for google web search. https://services.google.com/fh/files/blogs/google_delayexp.pdf, 2009.
- 6 S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In ACM SIGCHI, pages 181–186. ACM, 1991.
- 7 S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. ACM Sigmod record, 26(1):65–74, 1997.
- 8 F. Chirigati, H. Doraiswamy, T. Damoulas, and J. Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In ACM SIGMOD, pages 1–15. ACM, 2016.
- 9 A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Çetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466–1477, 2015.
- 10 A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: Interactive analytics through pen and touch - video. http://emanuelzgraggen.com/assets/video/vizdom_v1.0.mp4. Accessed: 2018-04-01.
- 11 A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. Vizdom: interactive analytics through pen and touch. PVLDB, 8:2024–2027, 2015.
- 12 A. Crotty, A. Galakatos, E. Zgraggen, C. Binnig, and T. Kraska. The case for interactive data exploration accelerators (IDEAs). In HILDA@SIGMOD, page 11. ACM, 2016.
- 13 F. Di Tria, E. Lefons, and F. Tangorra. Benchmark for evaluating approximate query processing on data streams. In Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE), 2017 IEEE 26th International Conference on, pages 191–196. IEEE, 2017.
- 14 M. El-Hindi, Z. Zhao, C. Binnig, and T. Kraska. Vistrees: fast indexes for interactive data exploration. In ACM SIGMOD, page 5, 2016.
- 15 F. Färber, N. May, W. Lehner, P. Große, I. Müller, H. Rauhe, and J. Dees. The SAP HANA database – an architecture overview. IEEE Data Eng. Bull., 35(1):28–33, 2012.
- 16 A. Galakatos, A. Crotty, E. Zgraggen, C. Binnig, and T. Kraska. Revisiting reuse for approximate query processing. PVLDB, 10(10):1142–1153, 2017.
- 17 P. Hanrahan. Analytic database technologies for a new kind of user: the data enthusiast. In ACM SIGMOD, pages 577–578. ACM, 2012.
- 18 J. Heer and B. Shneiderman. Interactive dynamics for visual analysis. Queue, 10:30, 2012.
- 19 Omitted due to double blind requirement.
- 20 P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. Combining user interaction, speculative query execution and sampling in the dice system. PVLDB, 7:1697–1700, 2014.
- 21 N. Kamat, P. Jayachandran, K. Tunga, and A. Nandi. Distributed and interactive cube exploration. In ICDE, pages 472–483. IEEE, 2014.
- 22 S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In ACM SIGCHI, pages 3363–3372. ACM, 2011.
- 23 M.-T. Ke, S. Fujimoto, and T. Imai. Seedb: a simple and morphology-preserving optical clearing agent for neuronal circuit reconstruction. Nature neuroscience, 16:1154–1161, 2013.
- 24 D. A. Keim. Information visualization and visual data mining. IEEE Trans. Vis. Comput. Graph., 8(1):1–8, 2002.
- 25 A. Kemper and T. Neumann. Hyper: A hybrid oltp&olap main memory database system based on virtual memory snapshots. In ICDE, pages 195–206. IEEE, 2011.
- 26 F. Li, B. Wu, K. Yi, and Z. Zhao. Wander join: Online aggregation via random walks. In ACM SIGMOD, pages 615–629. ACM, 2016.
- 27 Z. Liu and J. Heer. The effects of interactive latency on exploratory visual analysis. IEEE transactions on visualization and computer graphics, 20:2122–2131, 2014.
- 28 Z. Liu, B. Jiang, and J. Heer. immens: Real-time visual querying of big data. In Computer Graphics Forum, volume 32, pages 421–430. Wiley Online Library, 2013.
- 29 Monetdb. http://www.monetdb.org. Accessed: 2017-11-02.
- 30 J. Nielsen. Powers of 10: Time scales in user experience. Retrieved January, 5:2015, 2009.
- 31 B. of Transportation Statistics. Bureau of transportation statistics. http://www.transtats.bts.gov, 2017. Accessed: 2017-10-21.
- 32 Paxata. http://www.paxata.com. Accessed: 2017-11-02.
- 33 S. C. Seow. Designing and engineering time: The psychology of time perception in software. Addison-Wesley Professional, 2008.
- 34 B. Shneiderman. Response time and display rate in human performance with computers. ACM Computing Surveys (CSUR), 16(3):265–285, 1984.
- 35 B. Shneiderman. The eyes have it: A task by data type taxonomy for information visualizations. In Visual Languages, 1996. Proceedings., IEEE Symposium on, pages 336–343. IEEE, 1996.
- 36 Snappy data. https://www.snappydata.io/. Accessed: 2017-11-02.
- 37 H. H. G. W. Steffen Oeltze, Helmut Doleisch. Interactive visual analysis of scientific data. In Presentation at IEEE VisWeek 2012, 2012.
- 38 C. Stolte, D. Tang, and P. Hanrahan. Polaris: A system for query, analysis, and visualization of multidimensional relational databases. IEEE Trans. Vis. Comput. Graph., 8(1):52–65, 2002.
- 39 Tableau. http://www.tableau.com. Accessed: 2017-11-02.
- 40 Tableau. Tableau learn. https://www.tableau.com/learn/tutorials/on-demand/building-dashboard. Accessed: 2018-04-01.
- 41 TPC-DS. http://www.tpc.org/tpcds/, 2016. Accessed: 2017-11-02.
- 42 TPC-H. http://www.tpc.org/tpch/, 2016. Accessed: 2017-11-02.
- 43 Trifacta. http://www.trifacta.com. Accessed: 2017-11-02.
- 44 E. Zgraggen, A. Galakatos, A. Crotty, J.-D. Fekete, and T. Kraska. How progressive visualizations affect exploratory analysis. IEEE transactions on visualization and computer graphics, 23(8):1977–1987, 2017.
- 45 E. Zgraggen, R. Zeleznik, and S. M. Drucker. Panoramicdata: Data analysis through pen & touch. IEEE transactions on visualization and computer graphics, 20(12):2112–2121, 2014.