Adaptive Learning of Aggregate Analytics under Dynamic Workloads
Large organizations have seamlessly incorporated data-driven decision making in their operations. However, as data volumes increase, expensive big data infrastructures are called to rescue. In this setting, analytics tasks become very costly in terms of query response time, resource consumption, and money in cloud deployments, especially when base data are stored across geographically distributed data centers. Therefore, we introduce an adaptive Machine Learning mechanism which is light-weight, stored client-side, can estimate the answers of a variety of aggregate queries and can avoid the big data backend. The estimations are performed in milliseconds are inexpensive and accurate as the mechanism learns from past analytical-query patterns. However, as analytic queries are ad-hoc and analysts’ interests change over time we develop solutions that can swiftly and accurately detect such changes and adapt to new query patterns. The capabilities of our approach are demonstrated using extensive evaluation with real and synthetic datasets.
With the rapid explosion of data volumes and the adoption of data-driven decision making, organizations have been struggling to process data efficiently. Because of that a surge of companies is turning to popular cloud providers that have created large-scale systems capable of storing and processing large data quantities. However, the problem still remains in that multiple analytics queries are issued by multiple analysts (Figure 1) which often overburden data clusters and are costly.
Data analysts should be able to extract information without significant delays so as not to violate the interactivity constraint set around 500ms (Liu and Heer, ). Anything over that limit can negatively affect analysts’ experience and productivity. This constraint is particularly important in the context of exploratory analysis (Idreos et al., 2015). Such analyses are an invariable step in the process of understanding data and creating solutions to support business decisions. Furthermore, aggregate analytics are becoming increasingly geo-distributed, which are time consuming and nearly impossible when data have to remain federated without the possibility of transferring them to central locations (Vulimiri et al., 2015). Same applies to sensitive data that can only be accessed via aggregate queries with no data samples allowed.
Vision: Depicted at Figure 1 is our vision for an aggregate analytics learning & prediction system that is light-weight, stored on an Analyst’s Device (AD) and adaptive to dynamic query workloads. This allows the exploratory process to be executed locally at ADs providing predictions to aggregate queries thus not overburdening the Cloud/Central System (CS). Prediction-based aggregate analytics is expected to save computational and communication resources, which could be devoted to cases where accurate answers to aggregate queries are demanded. From the CS’s perspective, our system acts as a pseudo-caching mechanism to reduce communication overhead and computational load when it is necessary, thus, allowing for other tasks/processes to run.
Our system offers a learning-based, prediction-driven way of performing aggregate analytics in ADs accessing no data. It neither requires data transmission from CS to ADs nor from ADs to CS. What makes such a system possible is the exploitation of previously executed queries and their answers sitting in log files. We adopt Machine Learning (ML) regression models that learn to associate past executed queries with their answers, and can in turn locally predict the answers of new queries. Subsequent aggregate queries are answered in milliseconds, thus, fulfilling the interactivity constraint.
Furthermore, our framework can directly adapt to analysts’ (dynamic) query workloads by monitoring the analysts’ query patterns and adjusting their parameters. Shown at Figure 1 are the ML models and mechanisms developed for detecting and adapting to changes in query patterns. Both of them are discussed in Sections 3 and 4.
Challenges & Contribution: A large number of analysts exist within an organization with diverse analytics interests thus their query patterns are expected to differ, accessing different parts of the whole data-space. We are challenged to support model training over vastly different patterns, which are to be drastically changing or expanding in dynamic environments. Moreover the models have to be up-to-date w.r.t. pattern changes, which require early query pattern change detection and efficient adaptation. Given these challenges, our contributions are:
A novel query-driven mechanism and query representation that associates queries with their respective answers and can be used by ML models.
A local change detection mechanism for detecting changes in query patterns based on our prediction-error approximation.
A reciprocity-based adaptation mechanism in light of novel query patterns, which efficiently engages the CS to validate the prediction-to-adaptation states transition and guarantees system convergence.
Comprehensive assessment of the system performance and sensitivity analysis using real and synthetic data and query workloads.
2. Fundamentals of Query-Driven Learning
The fundamentals of query-driven mechanism for analytics are: (1) transforming analytic queries in a real-valued vectorial space, (2) quantization of vectorial space, extracting query patterns and (3) training of local regression models for predicting query answers using past issued queries. Principally, we learn to associate the result of a query using the derived query patterns and linking these patterns with local regression models. Given an unseen query, we project it to the closest query pattern we have learned and then predict its corresponding result without executing the query over the data in a DC/CS.
Definition 2.1 ().
A dataset is a set of row data vectors with real attributes .
Analytics queries are issued over a -dimensional data space and bear two key characteristics: First, they define a subspace of interest, using various predicates on attribute (dimension) values. Second, they perform aggregate functions over said data subspaces (to derive key statistics over the subspace of interest). We adopt a general vectorial representation for modeling a query over any type of data storage/processing system.
Predicates over attributes define a data subspace over formed by a sequence of logical conjunctions using (in)equality constraints (). A range-predicate restricts an attribute to be within range [, ]: . We model a range query over through conjunctions of predicates, i.e., represented as a vector in .
Definition 2.2 ().
A (range) row query vector is defined as corresponding to the range query . The distance between two queries and is defined as the norm (Euclidean distance): .
This representation is flexible enough to accommodate a wide variety of queries. As the dimensionality of the query vector is proportional to the data vector, queries with predicates bounding the values of different attributes can be used by the same ML algorithm. This means that only a number of values are set w.r.t the number of predicates for a given query. In addition, we make no assumptions as to the back-end system as what is being parsed are the filters in a query. This allows the mechanism to be deployed in parallel to multiple analytic systems.
In query-driven learning, we learn to associate a query with its corresponding aggregate result (a scalar ). This is achieved using a set of query-result pairs obtained after executing queries over dataset . The goal is to develop an ML model based on to minimize the expected prediction error between actual and predicted , and predict the result of any unseen query without executing it over .
2.1. Query Space Clustering
Recent research analyzing analytics workloads from various domains has shown that queries within analytics workloads share patterns and their results are similar having various degrees of overlap (Wasay et al., 2017).
Based on this evidence, we mine query logs (the set) and discover clusters of queries (in the vectorial -dimensional space), having similar predicate parameters.
This partitioning is fundamental to get accurate ML models for predictive analytics, as we then associate different ML predictive models with different clusters.
In this way, learning different data sub-sets is proven to be more efficient in terms of explainability / model-fitting and predictability than having one global ML model learn everything and is also (Masoudnia and Ebrahimpour, 2014) known as ensemble learning (Friedman et al., 2001).
To put this in context, consider a discrete time domain , where at each time instance an analyst issues a query . The query is executed and an answer is obtained, forming the pair .
The issued queries are stored in a growing set . Given this set, we incrementally extract knowledge from the query vectors and then train local ML models that predict the associated outputs given new, unseen queries. This is achieved by on-line partitioning the vectors into disjoint clusters that represent the query patterns of the analysts (fundamentally, within each cluster the queries are much more similar than the queries in other clusters). The distance between queries quantifies how close the predicate parameters are
in the vectorial space. Close queries and are grouped together into
2.2. Query-Answer Predictive Model
Each aggregate result from the pair represents the answer produced by the CS. Essentially, is produced by an unknown function which we wish to learn. Such function produces query-answers w.r.t an unknown distribution . Our aim is to approximate the true functions for aggregate functions (descriptive statistics) e.g., count, average, max, sum etc. Regression algorithms are trained using query-answer pairs from to minimize the expected prediction error between actual , from the true function , and predicted from an approximated function , i.e., . After having partitioned the query space into clusters , we therein train local ML models, that associate queries belonging to cluster with their outputs . Each ML model is trained from query-response pairs from those queries which belong to such that is the closest representative to those queries. The originally trained ML models in DC/CS are then sent to ADs to be used for predicting answers. Given a query only the most representative model is used for prediction, corresponding to the closest :
where if ; 0 otherwise.
3. Query Pattern Change Detection
Suppose that all trained ML models are delivered to the analysts from CS, indicating that the mechanism enters its prediction mode. That is, for each incoming query, it predicts the answer and delivers it back to the analysts without query execution. If we assumed a stationary query pattern distribution, in which queries and analysts’ interests do not change, then this would suffice. However, this is not realistic as it is highly likely that analysts interests change over time (e.g., during exploratory analytics tasks, which are considered as ad-hoc processes (Idreos et al., 2015)). So, dynamic workloads may render the models obsolete, as they were trained using past query patterns following distributions which may now be different. Accommodating such dynamics is becoming increasingly important as ML is widely adopted in software in production (Sculley et al., 2015). Specifically, when referring to analysts’ interests, we refer to analysts who are tasked with informing different business decision processes. If those tasks change, the data subspaces to be analyzed become different, which results in changed query patterns. If models cannot be adaptive, expected prediction errors can become arbitrarily high. When changes to , it is highly likely that any previous approximation would produce high-error answers, unless . We capture such dynamics as concept drift (Tsymbal, 2004; Ditzler et al., 2015) – many methods have been developed for adjusting when this arises (Gepperth and Hammer, 2016; Ditzler et al., 2015).
We introduce a Change Detection Mechanism (CDM) and an Adaptation Mechanism (ADM) (shown at Figure 1) addressing this concern raising a number of challenges: (1) How to detect a query pattern change; we need to enable triggers that alert the mechanism being in prediction mode in case of a concept drift; (2) What kind of action should we take in case that happens, i.e., what strategy to follow for updating the ML models; (3) How should we notify users, analysts, and applications about such change(s) or even who to notify; shall we transmit an update to all users or just the affected ones? We explore these challenges and the describe the decisions we take in tackling them in the remainder.
3.1. Change Detection Mechanism
So far, we have trained different local ML models to predict answers involving only the -th model that best represents a new incoming query through the representative . This requires to individually monitor whether the query representatives, used for prediction via their respective models, are still representatives in long-term predictions or whether the analysts’ query patterns have changed. In this case, we need to introduce a CDM that triggers when the original query representative has significantly diverged from the estimated one.
Our approach can be best understood by first assuming that the CDM maintains an on-line average of the prediction error such that : . This is done for each query representative across different users. Should the expected error escalate significantly, then this may signal that a query pattern has shifted around the ‘region’ represented by the representative . But, recall that during the prediction mode, the actual is unknown since our goal is to predict accurate answers but without executing the query itself. Hence, we develop an approximation mechanism for change detection, not requiring query executions over CS/DC.
Once we have trained the individual ML models and calculated their expected prediction accuracy (using an independent test sample drawn from the original set of queries) we obtain the Expected Prediction Error (EPE), which will be constant across all possible queries associated with a particular query representative defined as: . Using the EPE, we wish to find a fine-grained estimate of the true prediction error rather just assuming this is constant for each and every unseen query.
To do this, we have analyzed the error behaviour under changing query patterns.
Our findings reveal an interesting fact: The Euclidean distance of a random query from its closest query representative is strongly
correlated with the
associated prediction error .
where the natural-log operator acts as a penalizing/discount factor for queries given their distance from the closest representative . The second term within the natural-log operator, is the minimum distance between the query representative and the associated queries . We subtract the minimum distance from so that the scale of the numbers will not affect the computation of the error.
We base our novel CDM in (2) using the series of error approximations for monitoring concept drifts in query patterns during prediction mode without executing the queries.
Consider the incoming unseen (random) queries arriving in a sequence during prediction mode. They are answered by specific local ML models , generating a series of distance-based error estimations , . The CDM monitors this series and, based on a specific threshold, signals the existence of concept drift, i.e., checks whether the probability distribution of the queries has changed. Based on the series of error estimations, we learn two query distributions: (1) the expected query distribution, which is represented by the query representatives and (2) the novel query distribution, which cannot be represented by the current query representatives. The expected distribution is estimated given a training period from values corresponding to queries with closest representative . The novel distribution is estimated from values corresponding to error values derived from the rival representatives of queries with closest and . Based on this, we estimate the distribution of the error values generated from representatives which were not the closest to the queries, thus, approximating novel error values. Both distributions were approximated by fitting the distribution with scale and shape .
Given a value, we calculate the likelihood ratio and the cumulative sum of up to time , Based on the sequential ratio monitoring for a progressive concept drift in distribution (Grigg et al., 2003) from to , a decision function is introduced for signaling a potential concept drift expressed as:
The decision function in (3) indicates the current cumulative sum of ratios minus its current minimum value. This denotes that the change time estimate is the time following the current minimum of the cumulative sum, i.e., Therefore, given that , the decision function in (3) this is re-written in a recursive form: with setting, by convention, and . Hence, a concept drift of query patterns projected over the query representatives space is detected at time : The parameter is usually set with the standard deviation of . The process is shown at Figure 2, the cumulative sum of ratios exceeds the threshold as soon as queries are issued from an unknown distribution as the error estimates become steadily larger and are not just random fluctuations in errors. It is worth noting that the change in query distribution is based on fusing the distance between the queries and their closest representatives scaled with the expected prediction error. We refer to this as an indication of degradation in the performance of the model. Given that a change has been detected, the CDM signals the ADM which transits from prediction mode to buffering mode as shown at Figure 3. As soon as a change is detected the CDM signals the ADM component, that new query patterns have been detected. In turn, the ADM signals the Prediction Component (containing the and ) to be put in BUFFERING mode since the prediction component can no longer provide reliable answers for all queries. However, the AD can still leverage the complete system to ask queries following the already known distributions with only queries following the new shifted distribution being executed at the CS. By entering BUFFERING mode our ADM starts to adjust for the new query patterns under the AD until converging. At that point it signals the Prediction Component to switch back to PREDICTION mode, resuming normal operation.
4. Model Adaptation
In this section, we explain the fundamentals of the ADM along with unintended
results we can exploit. Once a local model transits to buffering mode,
it is deemed unreliable to accurately predict the answers to incoming queries.
Therefore, during this phase, queries should be executed
and their actual answers returned to analysts
while also being used for adapting the
model and representative . In the beginning of buffering mode,
and are sent to CS,
which will tune/adapt their parameters using
the actual execution of queries.
To reduce the expected number of queries executed in CS
during buffering mode, we introduce
a query execution selectivity mechanism
based on the current estimated error in (2).
Specifically, there would still be some queries issued by
the AD that could be locally answered by
current models cached in
AD during that phase.
Therefore, the AD still monitors incoming queries and discriminates
between two types: (1) the ones that can be locally answered by models in and (2) the ones that cannot be answered, since these queries are not well represented by the cached query representatives. The latter queries are then forwarded to
CS for execution. The selectivity mechanism relies on the
following rule: an incoming query at an AD
in buffering mode is locally answered if the distance from
the query representative of the new query patterns, notated by , is not the closest representative i.e., and . If the query is closest to the non-yet converged
novel representative, then it is forwarded to CS for execution.
However, since the novel representative is not converged, we also consider the distance from its rival (second closest) converged representative as a backup. The rival representative can provide assistance and answer the query locally instead of forwarding it to the CS if it is close enough to include in the range around
its variance .
Theorem 4.1 ().
Given a random query whose distance from its rival (second closest) representative is greater than , the upper bound of the forwarding probability for query execution is .
Let the query being projected to its closest representative , which is not yet converged and let its second closest be the converged . The representative corresponds to the mean vector of those queries belonging in the cluster . In order to the query to be forwarded to the CS for execution it means that the should not be the mean vector for the incoming query . This is indicated if the distance is greater than a proportion of the norm of the variance of the cluster by a factor . Hence, the query is sent from the AD to the CS if at least this distance is greater than , which is stochasitcally bounded by the factor based on Chebyshev’s inequality . ∎
4.1. Taking advantage of Affiliates
In the ADM, we take advantage of the tuning process taking place in the CS. We exploit queries coming from the original AD which triggered the CDM and other queries coming from different ADs also in buffering mode due to some other independent triggered CDMs. We call these potential ADs, affiliates belonging to set , since their executed queries and actual answers are used for tuning the stale models. Let ADs be connected to CS and assume that each AD system referring to model enters its buffering mode independently of the others with entry probability . Then, the probability of an AD (being in buffering mode) to meet at least one affiliate in CS is . The expected number of affiliates is then approximated by under the assumption that the entry probabilities are almost the same . This expectation will be used for studying the knowledge expansion in terms of novel query patterns being delivered to an AD via our reciprocity adaptation mechanism.
4.2. Model Adaptation & Reciprocity
In the CS, when a query is selectively forwarded from AD, the process of model adaptation has as follows: for adapting to new query patterns, we rely on the principle of explicit partitioning (Tsymbal, 2004; Ditzler et al., 2015), as a natural extension of our strategy using an ensemble of local ML models. To adjust to new query patterns, we train a new model using executed queries and their answers in CS. This is the optimal strategy for expanding the current as other methods might lead to catastrophic forgetting (Gepperth and Hammer, 2016). Indicatively, such methods adopt strategies to adapt the current model by adjusting to new patterns whilst forgetting the old ones. In our context, this is not applicable since analysts have the flexibility to issue queries either conforming to the old patterns or to the new ones, depending on the analytics process.
The adaptation process is performed with parameters: the query prototypes and their associated ML models as shown in Algorithm 1. Recall that the analyst’s device has cached models and the DC/CS adapts the received parameters by learning the new underlying query patterns and based on these trains the new ML model. Let the queries series coming from the AD to CS based on selective forwarding. This means that most likely a query conforms to new query patterns thus sent to CS for execution. Once query is executed and its actual answer is obtained, it is then considered as a new (initial) representative for . The pairs are then used to incrementally update and then buffered in , which will be the training set for (lines 4-13). The adaptation of to follow the new query pattern is achieved by Stochastic Gradient Descent (SGD) (Bottou, 2012), which is widely used in statistical learning for training in an on-line manner considering one training example (query-answer) at a time. We focus on the convergence of the query distribution by moving the new query representative towards the estimated median of the queries in and not the corresponding centroid. This is introduced so that the new representative converges to a robust statistic, free of outliers and more reliable than the centroid (mean vector). The convergence to the median denotes with high reliability convergence to the distribution, which is what we desire for model convergence. In this case, we provide the adaptation rule of the new query representative to converge to the median of the forwarded queries, as provided in Theorem 4.2.
Theorem 4.2 ().
The novel representative converges to the median vector of the queries executed in the DC w.r.t. update rule , ; is the signum function.
Each dimension of the median vector of the queries in a sub-space satisfies: . Suppose that the new representative has reached equilibrium, i.e., holds with probability 1. By taking the expectations of both sides of the update rule and focusing on each dimension , we obtain that: . Since is constant, then , which denotes that converges to the median of , . ∎
Given the -th incoming query issued by analysts in the -th affiliate AD, the CS assesses the selectivity forwarding criterion: . If it holds, these affiliate queries are exploited for expanding the query patterns (lines 14-18). That is, CS buffers the pairs in affiliate set which will be used later for training new ML models enriching the predictability variety of . In this case, we obtain the affiliate new representative generated by query patterns coming from the affiliate AD . Similarly to the new , affiliate is incrementally adapted through SGD with the aim to converge to the corresponding median of the affiliate queries in . For the new and the possibly affiliate , the median convergence rule involves the learning rate:
which decreases as more queries are appended to and ; the higher the number of affiliate query representatives, the faster the convergence to the median. This demonstrates the exploitation of affiliates to the adaptation of .
The convergence of the representatives is checked by the subsequent adjustments in positions that makes. If that change is lower than a threshold then convergence has been achieved. After the convergence of the query representative and affiliates (if any), the CS trains the new models and , using the and , respectively. The new ML models and new representatives are then delivered to AD (lines 20-26). Evidently, set is now expanded with one more representative and on average affiliate representatives along with their regression models. The adapted and updated is expected to have query representatives and ML models at the end of the buffering phase.
4.3. Convergence to an Offline Mode
When the system transits from the buffering to prediction mode, the enhancement of and gradually decreases the probability to enter the buffering mode in the future, in light of learning the query patterns not only derived from the analysts interacting with the CS/DC, but also the patterns from other analysts in other ADs. This indicates that the gradually expanding sets reflect the analysts’ way of exploring and analyzing data among data centers. Because of this expansion, the transition probability from prediction to buffering mode gradually decreases saving computational and communication resources at the network and CS. The expected ratio of new models in an AD transiting from the -th buffering mode with representatives to the -th prediction mode is: , with . Such ratio increases to unity after certain prediction-buffering transitions denoting that all query sub-spaces are known:
An AD model then learns all possible query sub-spaces via its analysts and affiliate models with rate . The entry probability to buffering mode decreases with the same rate, thus, reducing the CS execution overhead and communication load by transiting the AD to ’offline’ mode. This is the advantage of the query-driven analytics over dynamic workloads with the expected query execution rate in CS being bounded:
Theorem 4.3 ().
The expected query execution rate in the CS is bounded by .
Consider that at a certain time instance, there are affiliate ADs which enter in their buffering mode with entry probability . Then, the probability of existing at least one affiliate of an AD in the CS in the buffering mode is . Given that a query is forwarded with upper probability for those ADs being in the buffering mode, then, the expected number of queries being executed in the CS from an AD and its affiliates within any arbitrary time interval is . ∎
5. Evaluation Results
The main questions we are striving to answer in our evaluation are the following :
How accurate are the given predictions for a variety of aggregate queries ?
Is there a single ML model that can be used for this purpose ?
What are the effects of predicates and data dimensionality (number of columns) on estimating the results of aggregate queries ?
How light-weight and efficient are the models and can they be stored on ADs for efficient execution
How effective are the CDM/ADM mechanisms and what is the effect of continuously learning and adapting to new queries ?
5.1. Implementation & Experimental Environment
To implement our algorithms we used scikit-learn , XGBoost(Chen and Guestrin, 2016) and an implementation of the Growing-Networks algorithm (Marsland et al., 2002). We performed our experiments on a desktop machine with a Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz and 16GB RAM. For the real datasets, the GrowingNetworks algorithm was used for clustering mainly because of its invariance to selecting a pre-defined number of clusters and it’s ability to naturally grow (as required by our adaptability mechanisms). XGBoost was used as the supervised learning algorithm because of its superior accuracy to other algorithms we tested and shown as part of our evaluation.
Real datasets: We use the Crimes dataset from (6) and the Sensors dataset from (15). The Crimes dataset contains and the Sensors dataset data vectors. We created synthetic query workloads over these datasets as real query workloads do not exist for this purpose as also attested by (Wasay et al., 2017). For Crimes, we generated predicates restricting the spatial dimension and for Sensors the temporal dimension as essentially this is what analysts would be doing in exploration tasks. For the predicates in the spatial dimension we used multiple multivariate-normal distributions to simulate the existence of multiple users. For the temporal dimension, we used a uniform distribution. We then recorded the results of the descriptive statistics COUNT, MEAN, SUM over different attributes in the datasets to sufficiently make sure the workload is randomized.
Synthetic datasets: We also generated synthetic datasets and workloads to stress test our system. We generated a varying number of predicates and attributes to see how it would affect a state-of-the-art model chosen by an initial study comparing different models under different aggregates. This helped us understand the implications of our chosen representation and identified under what conditions the accuracy deteriorates.
We measured the prediction accuracy of our system using both synthetic and real datasets. We examine the median relative error, unless stated otherwise. We first examine the predictability of different descriptive statistics using a variety of supervised ML models. The experiments were conducted using our synthetic workloads to test accuracy across a varying number of attributes/columns and predicates. The models are trained using of the total queries () and tested against . Where: Ridge is for Regularized Linear Regression model, SGD is a linear regression model trained on-line using SGD, SVR is for Support Vector Regression with RBF kernel, XGB is the XGBoost model. None of the models was hyper-tuned to provide the purest accuracy as we desired to test whether different statistics could be better estimated by different models, thus indicating the need to choose optimal models at training. Figure 5 shows the results of this experiment. The main takeaways are as follows. First, query-driven learning consistently produces low relative errors across all statistics. Second, it is largely insensitive to underlying ML models; given our proposed representation, all models are able to predict the given statistics with small error, well below . Third, there is some high variation across the reported error as all of the workloads with varying predicates and columns were used. Finally, all statistics can be adequately predicted by a single algorithm, that being XGB. The latter represents an advantage for this work, as the system can be optimized for storing such models and can be designed around this single class of model instead of trying to accommodate a variety of them, each with its own restrictions.
Using the most accurate model derived from our experiments (XGB), we evaluated the prediction accuracy over different statistics under different conditions, shown in Figure 6. As expected, the increase in number of predicates used and number of columns has a negative impact on accuracy. The representation used is high dimensional as the query vector size is , where is the number of columns in a dataset. Hence, for columns used, Figure 6(right), the model is trained using . This is no trivial task and is stress testing our system’s capabilities. In addition, the number of predicates used is the number of elements which restrict the sub-space and are sparsely populated. Again this can make the fitting process harder. However, even under these conditions, the increase in error is no more than linear w.r.t to increasing the number of predicates and columns. With statistics such as MAX, being less impacted by this change, as their error seems to be invariant from the beginning. Nonetheless, the results are reassuring: The proposed approach delivers very low errors, even when large number of predicates and columns are used in queries. To put things in perspective the median number of columns selected in a query is around (Kandula et al., 2016) in typical workloads and also followed by our proposed workload.
We experiment with real datasets to demonstrate the applicability of our system under real conditions. As evident from Figure 7 our system provides estimations for descriptive statistics over different types of real datasets with relative error below . (A relative error below 10% is the target of modern state of the art approximate-answer production systems (Kandula et al., 2016)). We also tested these datasets usng VerdictDB (Park et al., 2018), a state-of-the-art system in Approximate Query Processing. The errors obtained varied from with sampling ratio of . These results are comparable to ours and show that our system can be reliably used in parallel to such engines when local access is needed and resource consumption at CS is to be minimized. After training the system with more than ca. 2000 queries the relative error starts approaching its minimum value rather swiftly for both datasets (named Crimes and Sensors. This demonstrates the capabilities of the proposed learning approach to offer high accuracy estimates for approximating analytical query answers with only a fairly small number of training queries. Note that typical industrial-strength in-production big data analytics clusters used for approximating answers to such analytical queries receive several million of queries per day (Kandula et al., 2016). Therefore, one can expect that a system employing our approach would receive a few thousand of training queries in a just few tens of seconds.
5.3. Performance & Storage
We examine the performance and storage requirements of our system. This is important as our solution has to be light-weight both in terms of storage overhead for ADs and efficient in transferring models through the network. We examine all the above-mentioned models to identify the most efficient ones in training and prediction. A synthetic workload with predicates and columns is used for training all the models. For Prediction Time (PT) in Table 1 we report on the expected prediction time and standard deviation of each model. As expected, SVR has the worst performance. The central takeway here is that PTs are negligible – much less than a millisecond, thus, guaranteeing efficient statistics estimation irrespective of the adopted model. Even though there are multiple models trained, to account for varying query patterns, the time complexity associated is with usually being small. We show that our learning models based on mining query logs and learning using query-answer pairs, can go a long way in amplifying the capabilities of analytic system stacks as they can act as a ’caching’ mechanism without actually storing the results of past queries but instead using models to perform answer estimation for new queries.
For measuring the training time of individual models, we varied the number of training samples and examined the expected model Training Time (TT) in Table 1. We used training samples varying in size wrt . For SVR, we stop recording after training samples as the algorithm is no longer efficient and should be avoided. Although XGB appears to perform the worst, we note that its TT is no more than seconds without using its multi-threading capabilities.
We also, examine the model Size in KB shown in Table 1 and observe that it refers to a negligent cost fulfilling our initial requirements about the system being light-weight to reside in the ADs memory. The results are for one individual model therefore the resulting storage cost is times the initial one. This could be a significant overhead, especially for SVR as . However, the cost incurred is not preventive as the benefit of decreasing latency times, offloading queries otherwise issued to the cluster and no extra monetary cost are far greater.
|Size (KB)||TT (s)||PT (ms)|
To examine our CDM, ADM due to concept drift
we devised the following experiment:
that has already learned a particular distribution of being deployed to answer queries.
At a particular point in time the query patterns
might shift as shown in Figure 8.
shows two different query distributions. Figure 8(left) are the distributions of the query answers . Their respective query patterns are shown in Figure 8 (right)
Parameters and are responsible for the ADM and CDM with the impact of shown in Figure 10(left). As we increase we allow for an early exit from buffering mode. An early exit means that less queries have been processed thus potentially the examples are not sufficient for accurately learning the distribution as witnessed in Figure 10(left), where the relative error increases, therefore the accuracy decreases as we increase .
Figure 10(right) shows the diminishing probability of entering the buffering mode building upon our discussion of slowly converging the system into an Offline mode. As more queries are processed across varying query spaces then our system is incrementally learning the whole data space. At a certain point, all query subspaces will be known along with their representatives. Thus, the probability of entering the buffering mode due to potentially unknown query distribution reduces to zero almost surely. We provide an experiment in which there is a predefined fixed number of Query Spaces (QS) , . Queries are generated randomly in a sequence from one QS to another, each time learning . Thus, given this fixed number of QSs and set , the probability of entering buffering mode can be approximated by , where denotes the number of known QS so far. Liaising this with Figure 10, we observe that the probability reduces in a step-wise manner tending to zero when . For a relatively high value, the rate of convergence to the offline mode becomes faster but with a higher error as witnessed by the previous experiment. As for parameter , a low value indicates smaller tolerance when estimating errors and vice versa. This might force the system to adapt when not needed. Thus, it is domain appropriate to hyper-tune the parameter accordingly; hyper-tuning is part of our on-going research. However, we have found the proposed heuristic of to work well empirically.
6. Related Work
Our work is related to prior work in analytical-query processing and in applied ML research communities and to prior work focusing on the benefits of the query-driven approach in analytical query processing and tuning (Anagnostopoulos and Triantafillou, 2015, 2017; Ma et al., 2018; Sattler et al., 2003). Analytical queries nowadays are executed over underlying systems that provide either exact answers (Melnik et al., 2010; Thusoo et al., 2009) or approximate answers (Park et al., 2018; Agarwal et al., 2013; Park et al., 2017; Kandula et al., 2016; Hellerstein et al., 1997) working over large big data clusters in DCs/CS requiring several orders of magnitude longer query response times. The contributions in this work are largely complementary to all this work. Specifically, during the training phase and in the adapting/buffering phase the system proposed here can be supported either by an exact or an approximate query processing engine. In addition, what makes our solution different is that it can be stored locally on an analyst’s device as it has low storage overhead and also requires no communication to the cluster.
Query-driven models are largely being deployed for both aggregate estimation (Anagnostopoulos and Triantafillou, 2017, 2015) and for hyper-tuning (Van Aken et al., 2017) database systems. Unlike (Anagnostopoulos and Triantafillou, 2017, 2015) our focus is on a wide variety of aggregate operators and not just COUNT for selectivity estimation. Furthermore, we address the crucial problem of detecting query pattern changes and adapting to them, which (to our knowledge) has not been addressed in this context before. Hence,our framework can be leveraged by all query-driven implementations in cases of dynamic workloads that are non-stationary.
Moreover, concept drift adaptation is well understood (Tsymbal, 2004; Gepperth and Hammer, 2016; Ditzler et al., 2015; Elwell and Polikar, 2011), mostly dealing with classification tasks, where classifiers adapt to new classes. We adapt concept drift to query-driven analytical processing, relying on explicit partitioning (Gepperth and Hammer, 2016), ensuring it avoids destructive forgetting given that the accuracy for the previously learned query patterns will not degrade. It is also favorable given our initial off-line design which already uses partitioning for clustering the query patterns and learning local models in given sub-spaces. Our work contributes with monitoring and detecting real-time query patterns change based on approximating the prediction error, which differentiates with the previous concept drift methods by measuring the actual error; evidently, this is not applicable in our case. Finally, we propose a novel reciprocity-driven adaptation mechanism in which we set a mechanism deciding when a new model should be trained engaging the knowledge derived from other possibly changing models in the CS.
In this work we contribute a novel framework for adapting trained models under concept drift. We focus on models used for estimating analytical query answers efficiently and accurately, however we note that the framework is applicable in other domains as well. The contributions centre on a novel suit of ML models, which mine past and new queries and incrementally build models over quantized query-spaces using a vectorial representation. The described mechanisms (ADM and CDM) bear the ability to adapt under changing analytical workloads, while maintaining high accuracy of estimations. As shown by our evaluation (using real and synthetic datasets), the proposed approach enjoys high accuracy (well below 10% relative error) across all aggregate operators, with low response times (well below a millisecond) and low footprint and training-time overheads. The contributed adaptability mechanism is able to detect changes using estimated errors and swiftly adapt. Furthermore, as more queries are processed our system has the potential to reach global convergence as no more query patterns remain undiscovered. This can significantly reduce unnecessary communication to cloud providers thus reduce network load and monetary costs.
8.1. Real Datasets & Workloads
Crimes: For constructing the workload over the Crimes dataset we initially sampled 10K points and obtained the mean and standard deviation for the attributes X_Coordinate,Y_Coordinate. By setting the obtained statistics as parameters for a multivariate normal distribution we generated a number of Cluster points (). Centered on each one of those cluster points we constructed multivariate normal distributions with a fraction () of the original standard deviation. Based on this we generated a number of points () for around each cluster point. Based on those points we constructed queries covering a random range across both the X/Y Coordinates. For each point a fraction of the original std was used to construct ranges : , where x_std is the original standard deviation, same approach was followed for the y_range. The queries filtered the complete dataset based on (where pt is the point) :
On the filtered dataset generated by each query with varying cardinality we extracted basic statistics like COUNT, AVG, SUM on attributes that would make sense (Beat - avg, Arrest - sum).
Sensors: For Sensors we obtained the mean and standard deviation of the temporal dimension after encoding it and normalizing it. The min/max statistics were also obtained. We then generated a number of queries () with range equal to a fraction of the complete distance between the max/min (. The center of the queries was randomly generated by a normal distribution with parameters equal to the obtained mean and half original standard deviation. The same processes as in Crimes was used to filter the complete dataset and extract statistics on attributes that made sense (Temperature-avg, Light-sum).
8.2. Synthetic Data Generation
Specifically, we generated datasets with varying dimensionality to simulate data with large number of columns. Each column contains numbers generated from a uniform distribution , therefore each point is where . We then generated a number of query workloads with a varying number of selected columns to be restricted by predicates (Increasing predicates even further resulted in no tuples returned) according to the dimensionality of each of the synthetic datasets. For instance, for a dataset with the resulting workloads were two, with the selected columns being . Each query point for those two datasets is then .
We also describe the query generation process in Algorithm 2. We first set the number of queries and number of predicates and columns in our datasets , . The retrieved range size is obtained by a Normal distribution with the mean being a fraction of the complete range defined by selectivity ration to ensure enough tuples are returned as with an increasing number of predicates less tuples are returned. We then loop through over the parameters issued and generated queries using the provided algorithms. Where the function returns a random vector of size over the given range, and generates a random bitmask for the selection of columns taking part in the query. Then the lower-bound and upper-bound for the given predicates is set and we execute the query over the restricted space given by the bounds over the selected columns by using dataset .
- copyright: rightsretained
- ccs: Mathematics of computing Exploratory data analysis
- ccs: Information systems Data analytics
- ccs: Computing methodologies Supervised learning by regression
- ccs: Computing methodologies Lifelong machine learning
- The number of clusters is automatically identified by the clustering algorithm used. (12)
- A 0.3 Pearson’s Correlations was obtained on a real dataset.
- This is associated with the vigilance parameter in Adaptive Resonance Theory dealing with the bias-plasticity dilemma.
- Only two dimensions shown for visualization purposes
- (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. Cited by: §6.
- (2015) Learning set cardinality in distance nearest neighbours. In Data Mining (ICDM), 2015 IEEE International Conference on, pp. 691–696. Cited by: §6, §6.
- (2017) Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD) 11 (4), pp. 47. Cited by: §6, §6.
- (2012) Stochastic gradient descent tricks. In Neural networks: Tricks of the trade, pp. 421–436. Cited by: §4.2.
- (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §5.1.
- (2018) Crimes - 2001 to present. Note: URL: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2Accessed: 2018-08-10 Cited by: §5.1.
- (2015) Learning in nonstationary environments: a survey. IEEE Computational Intelligence Magazine 10 (4), pp. 12–25. Cited by: §3, §4.2, §6.
- (2011) Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks 22 (10), pp. 1517–1531. Cited by: §6.
- (2001) The elements of statistical learning. Vol. 1, Springer series in statistics New York, NY, USA:. Cited by: §2.1.
- (2016) Incremental learning algorithms and applications. In European Symposium on Artificial Neural Networks (ESANN), Cited by: §3, §4.2, §6.
- (2003) Use of risk-adjusted cusum and rsprtcharts for monitoring in medical contexts. Statistical methods in medical research 12 (2), pp. 147–170. Cited by: §3.1.
- (2018) Growing networks. Note: URL: https://github.com/Skeftical/GrowingNetworksAccessed: 2018-08-10 Cited by: footnote 1.
- (1997) Online aggregation. In Acm Sigmod Record, Vol. 26, pp. 171–182. Cited by: §6.
- (2015) Overview of data exploration techniques. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 277–281. Cited by: §1, §3.
- (2018) Intel lab data. Note: URL: http://db.csail.mit.edu/labdata/labdata.htmlAccessed: 2018-08-10 Cited by: §5.1.
- (2016) Quickr: lazily approximating complex adhoc queries in bigdata clusters. In Proceedings of the 2016 International Conference on Management of Data, pp. 631–646. Cited by: §5.2, §5.2, §6.
- The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization & Computer Graphics (1), pp. 1–1. Cited by: §1.
- (2018) Query-based workload forecasting for self-driving database management systems. In Proceedings of the 2018 International Conference on Management of Data, pp. 631–645. Cited by: §6.
- (2002) A self-organising network that grows when required. Neural networks 15 (8-9), pp. 1041–1058. Cited by: §5.1.
- (2014) Mixture of experts: a literature survey. Artificial Intelligence Review 42 (2), pp. 275–293. Cited by: §2.1.
- (2010) Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment 3 (1-2), pp. 330–339. Cited by: §6.
- (2018) VerdictDB: universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data, pp. 1461–1476. Cited by: §5.2, §6.
- (2017) Database learning: toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 587–602. Cited by: §6.
- (2003) Quiet: continuous query-driven index tuning. In Proceedings of the 29th international conference on Very large data bases-Volume 29, pp. 1129–1132. Cited by: §6.
- (2015) Hidden technical debt in machine learning systems. In Advances in neural information processing systems, pp. 2503–2511. Cited by: §3.
- (2009) Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2 (2), pp. 1626–1629. Cited by: §6.
- (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin 106 (2). Cited by: §3, §4.2, §6.
- (2017) Automatic database management system tuning through large-scale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1009–1024. Cited by: §6.
- (2015) Wanalytics: geo-distributed analytics for a data intensive world. In Proceedings of the 2015 ACM SIGMOD international conference on management of data, pp. 1087–1092. Cited by: §1.
- (2017) Data canopy: accelerating exploratory statistical analysis. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 557–572. Cited by: §2.1, §5.1.