Discover Aggregates Exceptions over Hidden Web Databases

Discover Aggregates Exceptions over Hidden Web Databases

Saad Bin Suhaim1, Weimo Liu1, Nan Zhang1 1The George Washington University
{ssuhaim, wliu, nzhang10}@gwu.edu
Abstract

Nowadays, many web databases “hidden” behind their restrictive search interfaces (e.g., Amazon, eBay) contain rich and valuable information that is of significant interests to various third parties. Recent studies have demonstrated the possibility of estimating/tracking certain aggregate queries over dynamic hidden web databases. Nonetheless, tracking all possible aggregate query answers to report interesting findings (i.e., exceptions), while still adhering to the stringent query-count limitations enforced by many hidden web databases providers, is very challenging. In this paper, we develop a novel technique for tracking and discovering exceptions (in terms of sudden changes of aggregates) over dynamic hidden web databases. Extensive real-world experiments demonstrate the superiority of our proposed algorithms over baseline solutions.

I Introduction

In recent years, decision support and data mining systems have became very important for analyzing very large databases. Yet these systems have limited functionality and lack of some important features. One of the very important features that these systems lack is the ability to detect and report interesting finding (e.g., exceptions). Moreover, these systems cannot directly be applied on hidden web databases in a time where these databases could have a rich and valuable content that interests various third parties.

In this paper, we develop a novel technique for discovering of aggregate queries, e.g., AVG, SUM, over hidden web databases.

Change Detection: Monitor and report changes in databases is very important for decision support and data mining systems. These systems benefit from executing aggregate queries (e.g., SUM, AVG) on very large databases to detect and report interesting finding, which can often be very expensive and resource intensive. Many systems use pre-computation of aggregates to improve response time. Although they do not calculate all aggregates (seen as view selection problem) and relay on a predetermined number of queries. An examples of such systems is the online analytical processing (OLAP).

Hidden Web Databases: The idea of hidden web databases is that they are behind restrictive search interfaces. These databases are not accessible through the traditional search engines, and the only way to access them is by submitting desired values for one or more attributes (to form a conjunctive search query) and get a small number (bounded by a constant which can be 50 or 100) of tuples that match the specified query. This type of databases is very common on the Internet, and some examples of such databases include Amazon.com, Yahoo! Autos, eBay.com, etc.

Problem Motivation: Hidden web databases contain a rich content that interests various third parties, such as government agencies, and academic and commercial sectors. These parties could benefit from the ability to monitor a wide variety of aggregate queries and report interesting finding, since these aggregates are the most common type of queries in decision support systems. For example, durring Black Friday/Cyber Monday of 2014, Amazon announced a huge discount on their new unlocked Amazon Fire smartphone from $499 to $199 (a $250 price drop). This announcement was resulted by a sharp drop in the average price of the same smartphone on the other websites (i.e., the average price on eBay for all Amazon Fire smartphones droped from $310 to $258). More generally, when the AVG price for a certain product that listed on a sale database increases rapidly, this may indicate an increase in demand for this product. Similarly, when the AVG salary offered on job database that require a certain skill increases quickly, this may indicate an expansion of the corresponding market.

Challenges: Discovering exceptions of aggregate queries on hidden web databases has two critical challenges: i) the challenge of hidden web databases in general; and ii) the challenge of dynamic aggregate estimation.

  1. Challenge of hidden web databases: Most of real-world hidden web databases do not directly support aggregate queries through their web interfaces. The only way to answer such aggregates is by combining multiple search queries. A problem with this solution is that most real-world web databases apply limitation to the number of search queries one can issue through per-IP and per-developer (i.e., eBay limits API calls to 5,000 per day). Prior work [8] discussed such way to estimate aggregate queries for static databases (i.e., assuming databases do not change overtime). This is unreasonable assumption, since in reality, most real-world web databases are frequently updated. An improved technique has been introduced by [3] to solve the limitation problem over dynamic databases. However, their solution only consider one single aggregate at a time. To discover exceptions over a dynamic database, we need to monitor a large group of aggregates, which make solving the limitation problem more challenging.

  2. Challenge of dynamic aggregate estimation: Prior work (e.g., [3]) has introduced techniques to overcome (i) up to a certain level and estimates aggregates over dynamic hidden web databases. Their existing algorithm applies a random walk technique and distributes the query budget available for reissuing (i.e., updating) previous drill downs, and the rest for initiating new drill downs to track and estimate various types of aggregates. This technique wastes a lot of queries by producing independent samples, such that each sample can only be used to estimate one candidate once (e.g., no information is retained/reused to other aggregate queries).

In contrast, applying traditional decision support systems to this problem requires a fully access to the database itself, something that is often not applicable for hidden web databases. Even if we assume that we somehow have a full access to the database, finding exceptions of aggregates constantly requires formidable resources. Previous studies [10, 11] proposed using precomputed samples of the data (e.g., uniform random sampling) to reduce response time while giving acceptable answers. Yet, selecting inappropriate samples for arbitrary aggregates would lead to large estimation errors. Therefore, selecting the right samples to give the right estimations and find exceptions while minimizing the query cost becomes a very complicated but important problem.

Ii Problem Definition

The problem we consider in this paper is how to track and discover exceptions of aggregates over dynamic hidden web databases.

Ii-a Dynamic Hidden Databases

In this paper, we restrict our discussion to categorical data. Consider a database with tuples , and attributes , …, , each of which has a discrete domain. The domain of is denoted by for each , where represents the cardinality of , i.e., the number of possible values of (domain values are always known). We assume no duplicate tuple exists in , and each tuple can be represented by a d-dimension vector .

We assume a prototypical interface where users can query the database by specifying values for a subset of attributes. Thus a search query is of the form:

SELECT FROM WHERE , where is a value from , and is the result of .

Let be the set of tuples in that satisfies . As it is common with most web interfaces, we shall assume that the query interface is restricted to only return tuples, where is a predetermined small constant (such as 100 or 500). Thus, will be entirely returned iff the number of tuples returned is less than . If the query is too broad (i.e., number of tuples returned is more than ), only the tuples in will be selected according to a ranking function, and returned as the query result. Note that repeating the same query may not retrieve new tuples, i.e., the same tuples may always be returned. Along with tuples, we also assume that the interface returns the total number of tuples satisfying . Let be the total number of tuples returned by the interface, such that can determine the status of ; We say is an overflow when , i.e., that not all tuples satisfying can be returned. At the other extreme, when is too specific such that , we have an underflow, i.e., no tuples returned. If then we have a valid query.

For the purpose of this paper, we assume that a restrictive interface does not allow users to scroll through the complete answer of when overflows. Instead, users must pose new queries by reformulating some of the search conditions. This is a reasonable assumption since many real-world interfaces (e.g., Google) limit page turns (e.g., 100 at the time this paper was written).

Ii-B Aggregate Query

Aggregate query for a target attribute with the selection condition takes the form of:

SELECT FROM WHERE

Where is an aggregate, such as , , , , and . The selection condition is a conjunction of , where is a value from .

For example, assume we have two attributes and , both are boolean attributes. could be a conjunction of AND , and we donate the result of this aggregate query as . Now let us say you want to know the average price of the HTC cellphones with a five-inch screen. The aggregate query takes the form of:

SELECT FROM WHERE AND

Ii-C Exception

Consider an aggregate query, based on its values with fixed time interval at . We define an aggregate query as an exception by three factors. Firstly, the aggregate which is always over only a few of tuples is unnecessary to consider. For example, if only one tuple satisfies the selection condition, the change of itself will be the change of the aggregate. Therefore, this aggregate won’t be as meaningful as others with a lot of tuples, so we set a to filter the small size aggregates. Thus, if , we do not treat its aggregate as a potential exception. Secondly, if the aggregate is far away from what it was before, we consider it as a potential exception. To measure how “far away” this aggregate is from what it was before, we compute the distribution of based on historical data. Then we can have an interval , the average of , and the most of belong to . If is out of , it is a low probability event. Thus, we continue to treat it as a potential exception. To define the low probability event, we set a , such that a low probability event has a chance of occurrence of less than or equal (or more than or equal ). Thirdly, the aggregate change percentage (increment/decrement) of last day comparing with previous days must be big enough. Thus, we set a , such that an aggregate with a change percentage for last day is an exception when the change percentage is more than or equal . Once an aggregate meets all these three factors, we treat it as an exception.

We define an aggregate as an exception at when it satisfies all conditions below:

(1) ;

(2) or ;

(3) ;

Where is the change percentage, is the cumulative distribution function of , and is a random variable such that in which is the average of and is their variance . Here, we assume that is in normal distribution form.

An example as follow: assume we want to monitor the average price of all cellphones, where , , and . If the average price of the Motorola cellphones with 4-inch screen drops suddenly, based on the historical data, the current AVG() drops or more and in the left side of the normal distribution graph. . Thus, “SELECT WHERE = Motorola & = 4-inch” is returned as an exception.

Iii Solution

A straightforward approach is to crawl the entire database , then find exceptions. The benefit of this solution is that exceptions can be found. However, this is unreasonable solution for this type of problem since you don’t only need to crawl in but also for all time interval . Moreover, because of query limitation forced on hidden web databases, we cannot enumerate all the selection conditions to check whether each of them is an exception or not. A more reasonable approach is to take a sample to estimate the . There is a deep research for answering this question for estimation with selection condition. But in this problem, estimating each selection condition separately wastes a lot of queries, because a sample can not only be for one selection condition, i.e., if we already know , where , is a sample for both and . For example, if we have the full list of the Motorola cellphones with 4-inch screen already, these cellphones can be a sample to estimate average price for both Motorola cellphones and 4-inch screen cellphones. Hence, we take a sample of the dataset first and then make different estimations based on it.

Now we introduce two concepts that are essential in our solution, Query-Pool and Apriori.

Query-Pool: A Query-Pool is introduced to guide the process of identifying queries. A query is a candidate when its result , i.e., the number of returned tuples is bigger than support threshold. Our purposed algorithms use as part of their solutions to identify candidates. Once all candidates are identified and in , we answer aggregate queries and find exceptions.

Apriori Algorithm: Apriori algorithm [1] is a famous and influential algorithm for mining frequent itemsets for boolean association rules. One of its functionality is to determine the frequent itemsets, i.e., the sets of item which has minimum support. Apriori extends frequent subsets by one item at a time in a “bottom up” approach (a step known as candidate generation) and tests the groups of candidates against the data. Similarly, the query candidates generation in our query-pool uses the same approach, where predicates are extended one attribute at a time until no further successful extensions are found. Recall that, as mentioned in Query-Pool, a query is a candidate when . For example, assume we have two attributes and , both are boolean, and . We have eight queries , , , …, . A query for is a candidate when .

Iii-a Baseline Algorithm

The Baseline algorithm consists of two phases. The first one is generating the query-pool using Apriori algorithm discussed in Section III. Once this phase is done, we have all candidates in . The second phase is to find all exceptions from the query-pool. This can be done using the random walk approach to sample hidden web databases, which is proposed in [2]. The idea of random walk is centered on over a query tree. The root level of query tree is SELECT FROM , where the query tree organizes queries from broad to specific from top to bottom. For each , the query appends a random conjunctive constraint to the selection condition until a valid query is reached.

To imagine the random walk process, assume a specific ordering of all binary attributes each time, e.g. . The random walk starts from the root by issuing the query = SELECT * FROM . Each time overflows, we expand it by adding the next attribute and assign a random predicate to it one at a time. For example, when overflows for the first time, we expand it by assigning a random selected predicate to (either “” or “”). This process leads to be either a valid or underflowing query. If is valid, then we randomly choose the returned tuple. Otherwise, we restart the random walk process.

Consider Figure 1, which shows a database with three attributes and three tuples, and a complete binary tree with 4 levels. The - level represents attribute and the leaves represent possible tuples. Each node that falls in a level has two edges labelled as 0 and 1 respectively. The leaves at level + 1 represent all possible assignments of values to the attributes. The combination of any assignment is unique such that there is no other leave node with the same path. Not all leaves necessary represent existing tuples as some leaves may be empty, which is very common in a real-world database. When applying the random walk, we start from the top node of the tree (i.e., the first attribute) and randomly we pick a value either 0 or 1 (i.e., represents the edge) with equal probability. Every time we pick an attribute we check whether it is an overflow or not. If it is an overflow, we pick a random value for the next attribute, in this case it is , and check again for its validity. We repeat this and assign attributes in order until we reach a valid or underflow query. If the query is underflow we start the process from the beginning, otherwise we have a valid query returning tuples. We then we pick one of the tuples with probability of .Note that the access probability of the tuple that gets picked is therefore = . Then, the tuple is accepted with probability where where C is a scale factor that boosts the selection probabilities to make the algorithm efficient. For categorical dataset, the only difference is that we have choices at - level. And the access probability is .

For each candidate in , we issue queries based on the approach above to get samples at . But instead of starting from SELECT FROM as the root, the root of each query candidate is the query candidate itself, i.e., if is SELECT FROM WHERE & , the root for this query tree is & . Based on our definition in Section II-A, the candidate starts with extremely broad (and thus overflowing) query and the drill-down process narrows the query down by adding randomly selected predicates, until a valid query is reached. Once we have a valid query, we select our sample. [2] describes the random walk process in more details.

When selecting a sample, we also compute the probability of this sample being selected in a drill-down. Same sample from different candidates could have different based on the number of drill-down levels performed before selecting the sample. To illustrate this, consider two candidates and , such that is and is . If both and select the same sample , their probabilities will be different, and respectively (i.e., has three drill-down levels while has two drill-down levels). Once we select the sample and calculate its probability, we estimate the average price for the candidate and judge whether is an exception or not by the definition in Section II-C. Thus, the algorithm is as follow:

Data: Workload
Result: A sample set and the estimations for
for  do
       node. take new samples for ; if  is exception then
             return as exception;
       end if
      
end for
is total number of candidates in . is a candidate. is set of samples corresponding to .
Algorithm 1 Baseline

However, a sample which satisfies the selected conditions of several candidates in at the same time can only be used to estimate one of the candidates once by this algorithm. For example, the sample above can be used for both candidates and , but by the baseline algorithm, it is only used once. When the intersection of two candidates is very big, this is obviously not a good idea. To estimate the aggregates of all candidates over time, the baseline algorithm treats every candidate on , , , separately. The estimations are independent with each others. Although it is simple, there are two obvious disadvantages. Firstly, it wastes numerous queries because no information is retained/reused from time to time. Secondly, when there are two candidates having an intersection with each other, it is a waste that the samples in the intersection are only for one estimation as the naive algorithm. In subsections III-B1 and III-B2, we introduce our STRATIFIED-DETECTOR and UDOMETER algorithms to address these two problems based on the state-of-art techniques separately.

Iii-B Our Algorithms

Iii-B1 Stratified-Detector

Given a query-pool , STRATIFIED-DETECTOR improves the selectivity of samples for this query-pool. The technique reduces query cost while minimizing estimation error for a given query-pool, which is proposed in [4]. The idea of stratified sampling is to generalize uniform sampling by partitioning the population into multiple strata and samples are selected uniformly from each stratum. The more ?important? the strata is, the more contribution it has on picking samples.

Stratified Sampling
Example:
Let us consider a database that has one attribute and four tuples, such that = {100}, = {150}, = {200}, and = {250}. Let us have the aggregate query = SELECT COUNT() FROM WHERE . Let P() defines the population of on as a set of size that contains the value of the aggregated column selected by , or 0 otherwise. We have P() = {0, 0, 1, 1}. Since P() has a mix of 1’s and 0’s, it is a nonzero variance, which makes uniform sampling a poor choice for this problem. To have a zero variance, we better partition into two starta {t1, t2} and {t3, t4}, such that P() contains now two starta {0, 0} and {1, 1}, with both have zero variance. Nonetheless, if we also have = SELECT COUNT() FROM WHERE , then P = {0, 1, 1, 1} (different than P()), and each query will have its own population of . So the challenge is to adapt stratified sampling so that it works well for all queries in query-pool.

In general, stratified sampling partitions into strata with tuples (where ) with tuples uniformly sampled from each stratum (where ). The stratified sampling proposed in [4] discusses how can we be apply stratified sampling effectively in databases. It consists of three steps (1) stratification, to determine the number of strata to partition into and the number tuples from for each stratum, (2) allocation, to determine how to divide into across strata, (3) sampling, apply random walk to sample tuples from stratum .

When applying the steps above into our problem, we divide the query-pool into starta, such that the number of these starta where for any starta , each selects either tuples in or none. The number of starta depends on both and (generally, the total number is upper-bounded by ). After identifying starta, stratified sampling algorithm picks samples. Note that the strata must be non-overlapping and mutually exclusive.

However, the above technique is not suitable when the size of query-pool is big (i.e., number of starta is large). Instead, when the space of one candidate is the subset of another , STRATIFIED-DETECTOR takes samples separately from and , then merge these two together by stratified sampling to estimate . Here we can get the estimations for both and . Note by applying this technique, we also need to use the probability of to calculate . Further more, if we have another candidate , which is a subset of , we can directly take sample from , and then get the estimation of by the sample of , and the stratified sample of we got previously. Moreover, when a candidate contains more than one longer candidate, we consider the longer candidate with the biggest subspace in term of size (i.e., the longer candidate with biggest COUNT). For example, if contains three longer candidates , , and with COUNT of 1000, 2000, and 3000 respectively, we use to estimate .

To illustrate this stratified sampling, let us reconsider and , such that is a two conjunction candidate and is a three conjunction candidate. Note that exists in . Thus, the average price of is

Where is the average price of and is the average price of . If we select the samples (level 2) and (level 3) to calculate , then
par

Note that and .

Similarly, we can calculate where the only difference is that the becomes (i.e., ). Algorithm 2 depicts the pseudocode.

Data: Query-Pool
Result: A sample set and the estimations for
  is  empty  in ;
for  do
       for   do
             if  does not include any in   then
                   take a new sample for ;
             end if
            else
                   stratified sampling based on the count; take a new sample for ;
             end if
            
       end for
      
end for
is the set of candidates with length workload. is a candidate.
Algorithm 2 STRATIFIED-DETECTOR

The details of the stratified sampling is:

For the current candidate , take the that contains the most tuples as . Assuming that we have samples already for , then take samples from the remaining subspace and then combing them with the sample in . And then we can have an estimation of the average of the target attribute, . is the count of tuples that satisfying the selected condition of the candidate . After we finished the loop of , we can have estimations for all the candidates in . After repeated on , because we have the estimations for all the candidates on , we can find out which ones are exceptions based on the definition.

However, when the variance of the samples is big, we cannot have a good result. The reason is, when we randomly pick samples for a candidate on , with the big variance, we cannot distinguish whether a dramatic change is caused by the sampling or the data itself. A straightforward solution is to take more than one sample for each candidate in each time interval in order to reduce the variance (if the change is caused by sampling). The more samples we collect, the more accurate estimations we get. Of course, this solution is not practical neither reasonable since we have a limited number of queries to issue.

To overcome this problem, we introduce our UDOMETER algorithm discussed in the next subsection.

Iii-B2 Udometer

Even though most of the real-world databases are dynamic (updated at arbitrary time), yet many of these databases do not see frequent updates. And the fewer changes happen to the database, the few changes the sample will see. This means that the random walk technique discussed in Section III-A could lead to a significant waste of queries after a period of time since no information is retained/reused. Not only that, but the saving of query cost can be directly translated to more accurate aggregate estimations. In particular, since updating a drill-down may consume fewer queries, the remaining query budget (after updating all previous drill downs) can be used to initiate new drill downs, increasing the number of drill downs and thereby reducing the estimation error.

To understand how saving query cost reduces estimation error, consider the example mentioned in Section III-A (Figure LABEL:fig:random_walk). Note that each drill down has a unique sequence number of the leaf-level node corresponding to it, which can be uniquely identified this drill down. We notify this as a , such that at each time interval where , we have a signature set where each defines one drill-down performed. Now, to collect any sample (from Figure 1) at , it requires at least 4 queries, i.e., from root to any leaf node (a tuple). If no change happens to the database at , we can reach the same leaf node from in only 1 query (by issuing the valid query from ). This means we can save 3 out of 4 queries. However, we still need to issue one more query over leaf node’s parent to determine if it is still the top non-overflowing query. This reduces the number of drill downs that can be updated at each interval time to at most half of the query budget. Moreover, the estimation produced in different time interval are no longer independent of each other due to the reuse of the same signature set (of drill-downs).

UDOMETER addresses the problem above by reissuing the same sample set from for all time interval where . Although, UDOMETER does not reduce the query cost for generating the query-pool (as Apriori still does it). Yet, it still minimizes the total query cost by reducing the query cost of the random walk process (along with reducing the estimation error). As for minimizing the total query cost even further for a given query-pool of aggregate queries, we still need to minimize the query cost of generating candidates.

When applying UDOMETER, we divide the problem into two parts based in the time interval: (1) Day one (i.e., ), (2) Day two and beyond . For day one (i.e., ), we apply the STRATIFIED-DETECTOR as is (Algorithm 2). Once finished, we will have a sample set for . Now for day two and beyond (i.e., ), we reissue the same sample set from . Note by doing so, each query-pool in is a subspace of the query-pool (from day one ). Although the number of candidates that is considered by the UDOMETER may be less than the actual number of candidates for a specific day, the accuracy of our estimations should be higher, which lead to more accurate exceptions. Algorithm 3 provides the pseudocode for UDOMETER.

Data: Query-Pool
Result: A sample set and the estimations for
  is  empty  in ;
for  do
       apply STRATIFIED-DETECTOR (Algorithm 2); store samples as the sample of ;
end for
for  to  do
       reissue same sample from ;
end for
Algorithm 3 UDOMETER

Reissuing the same set of samples decreases the variance of the samples and lead to more accurate estimation. UDOMETER is now able to detect the overall trend of the average price of most candidates (whether its an increment or decrement). Yet, detecting the overall trend is sometimes not enough to detect the exception itself. Two of the main reasons that an exception occurs for a candidate are (1) sudden drop/rise in price for sizable number of tuples, and (2) insertion or deletion of new and existing tuples. When we consider a subspace with the most tuples to estimate , these new or deleted tuples may not have a great impact on to be an exception because of the ratio size of to . Our PRIORITY-UDOMETER, which we introduce in the next subsection can handle both events.

Iii-B3 Priority-Udometer

In the previous sections, we mentioned that when we applied the STRATIFIED-DETECTOR, we divided the target candidate into two subspaces and , in which the count of is the biggest among the candidates which are the subspaces of . It is a straightforward idea because we only need to take a new sample in the smallest space . However, while this method to divide the space is good to estimate the aggregates like average and sum, it cannot achieve a good result for difference operator, like the change of average. For example, we have two candidates and , is the bigger subspace but the average is almost kept constant, while is slightly smaller than , but the average changes dramatically on , which leads to be an exception. In this situation, and is obviously a better partition. On the contrary, if is very small, although the average changes dramatically on , it has little impact on , and cannot be a good partition. Thus, we purpose a score function to decide which one is the best partition.

The purpose of the score function is to compare the impact of different candidates which are the subspace of the target . can be impacted by two factors, the size of the subspace , and how much the aggregate changed. The score function on we propose is

(1)

and we compute the score of each candidate which is a subspace of target , then choose the one with the biggest score as .

Iv Experiment

Fig. 1: Effect of
Fig. 2: Effect of
Fig. 3: Effect of
Fig. 4: Effect of
Fig. 5: Effect of
Fig. 6: Effect of
Fig. 7: Effect of query budget on recall.
Fig. 8: Effect of query budget on precision.
Fig. 9: Effect of with score function .
Fig. 10: Effect of with score function .
Fig. 11: Effect of with score function .
Fig. 12: Effect of with score function .
Fig. 13: Gray phones average price

Iv-a Experimental Setup

Dataset: We tested our algorithm over a real world (categorical) web database eBay.com, specifically Cell Phones and Smartphones category, to which we have full offline access. The dataset, which we crawled, is for the period from Oct 21st, 2015 to Nov 20th, 2015 and contains 2,393,361 tuples and 8 attributes. All tuples that have been crawled offer a “Fixed Price” option while the attribute domain sizes ranges from 4 to 82.

Challenges: Recall from Section I that one of the main challenges hidden web databases is the query budget available daily. Even though we have a complete access to the dataset, we still applied same number of query budget provided by eBay.com per-IP and per-developer. The default query budget per day is 5000.

However, because of query limitation, our algorithms first use the query budget available for one day as a bootstrapping to generate the query-pool before starting the algorithms the following day.

Query-Pool: Examining all candidates in a query-pool is very challenging because of the limitation of query budget. Bigger query-pool means less query budget we can spend per candidate. We tested our four algorithms under 5 different sizes of query-pool (i.e., with supported threshold 1000, 2000, 3000, 4000, and 5000). Based on our dataset, this means that we sometimes can only issue two query budget per-candidate, while other times we have the luxury of spending sixteen query budget per-candidate.

Releasing : When we tested the four algorithms, we released the percentage threshold from 15% to 12%. This is because in some settings, a lot of real exceptions in the dataset have the exact of 15% change that make them hard to detect without releasing .

Algorithms Evaluated: We tested four algorithms presented in this paper: Baseline, STRATIFIED-DETECTOR, UDOMETER, and PRIORITY-UDOMETER. All these four algorithms share the same database-controlled parameter of query budget per day and the parameters , , and that are presented in subsection II-C that define exceptions.

Performance Measures: For estimation accuracy, we measure the percent error (i.e., for an estimator of aggregates ). We also measure the precision (i.e., for an estimator of aggregates ).

Iv-B Experimental Results

Recall and Precision We start comparing the performance of all four algorithms for the recall under the default setting per-day query budget of 5000. Figure 4 shows the recall of the four algorithms when = 5000 and 0.01 0.99. As shown, both PRIORITY-UDOMETER and UDOMETER improve their recall when is larger. However, when becomes very large both algorithms reach almost 100% recall, and this is because both algorithms can almost download the entire tuples corresponding to its candidate. On the other hand, the STRATIFIED-DETECTOR and BASELINE do not show any improvement. In Figure 4, all algorithms achieve a higher recall with bigger supported threshold. This is because when the support threshold is large, there are less candidates in the query-pool and each candidate has larger space and sample size. Figures 4 and 4 depicts a performance comparison between the four algorithms in terms of precision. In Figure 4, the impact of for both PRIORITY-UDOMETER and UDOMETER is small. And with larger the gap between these two algorithms is small. The precision for these two algorithms is not affected by the drill-downs. This is because when support threshold is large, each candidate has bigger sample size to spend. On the other hand, the gap between these two algorithms and the other two (STRATIFIED-SAMPLER and BASELINE) is very big.

Figure 13 shows that the change threshold does not have a big impact on the recall. And with larger and , there is barely a gap between PRIORITY-UDOMETER and UDOMETER. So the change threshold nearly does not lead to any difference between these two algorithms. As for STRATIFIED-DETECTOR and BASELINE, their recall seems to improve when is large. However, and as shown in Figure 13, the precision for both algorithms seems to improve when is smaller.

In contrast, Figures 13 and 13 reflect how query limitation has big impact on both recall and precision. The recall and drops from 95% for larger query budget to close to 55% in worst case for PRIORITY-UDOMETER (or 35% for UDOMETER). Moreover, the precision also drops from 95% to 40% for both algorithms. When we can only spend two query budget per-candidate to estimate its average price and judge whether it is an exception or not. Thus, we get low recall and precision. Overall, the PRIORITY-UDOMETER have higher recall and precision than UDOMETER.

Score Function vs. Highest COUNT: Figure 13, and 13 depict the recall measurement of that used in PRIORITY-UDOMETER comparing with taking highest COUNT in UDOMETER when we choose subspace. These figures take into consideration only the candidates that contain longer conjunction (i.e., candidates with subspaces). No matter what the or support threshold are, the score function always give better prediction comparing with taking just the highest COUNT. Thus, PRIORITY-UDOMETER is more accurate for candidates with subspaces. Moreover, Figures 13 and 13 compare the precision under both and . Similar to the recall, the PRIORITY-UDOMETER in general gives more accurate precision.

Finally, Figure 13 depicts how PRIORITY-UDOMETER detects the overall trend for candidates, even when it incorrectly judges whether a candidate is an exception or not. The figure shows how the average price of phones with gray color drops sufficiently on Nov 20th to be an exception (i.e., which PRIORITY-UDOMETER detected correctly).

V Related Work

Crawling and Extraction for Hidden Databases: There has been a number of prior work in crawling and extracting hidden databases content. This crawling requires understanding of query interfaces which was extensively studied (e.g., [5, 6]). Even though crawling the entire database (e.g., [7]) could lead us to find all exceptions, it still unreasonable solution for this particular problem as we not only need to crawl the database one time but also for all time interval (e.g., every day!). Moreover, because of query limitations forced on hidden databases, we cannot enumerate all the selection conditions to check whether each of them is an exception or not.

Aggregate Estimations over Hidden Web Databases: Aggregate estimations over hidden web databases has been investigated over time for both static and dynamic databases. Both types use efficient techniques to obtain random samples from hidden web databases described in [3, 8]. Unlike this paper, all prior work focuses on answering estimation with one select condition. But in this problem, estimating each select condition separately waste a lot of queries, where a sample could be used for one or more select conditions.

Approximate Answers for Aggregate Queries: Decision support applications and data mining applications execute aggregate queries over very large databases to obtain important and useful information. Existing work (e.g., [4, 9]) introduced an efficient approach using a stratified sampling technique to optimize scalability and resources when providing approximate answers for a given workload of queries. Unlike their workload where tuples have probability of occurrence given as an input, our workload has no knowledge about the distribution function specified by the workload. Moreover, the workloads we consider do not have to be fixed nor similar and the size of these workloads could be very large, which make their optimal stratification technique unsuitable for this particular problem. Other work (e.g., [16]) focuses only on a specific type of query (i.e., join-queries) for approximate query answering.

Data Discovery and Exploration: Detecting anomalies on very large data, such as data cubes and data warehouses, is very important to detect problem areas or new opportunities. Most of existing work (e.g., [12, 13, 14]) focuses on traditional multidimensional databases where a full access to these databases is a must. Other work (e.g., [15]) discusses detecting meaningful changes over hierarchically structured data, such as nested object data, while [17] presents a single operator on OLAP products to summarize reasons for drops or increases observed at an aggregated level. Moreover, detecting changes for large scale data using sampling has been discussed (e.g., [18]).

Vi Conclusion

In this paper, we have developed a novel technique for tracking and discovering exceptions over dynamic hidden web databases, which change over time through its restrictive web search interface. In general, our technique consists of two main phases: (1) generating a query-pool that contains all query candidates, and (2) finding all exceptions (in terms of sudden changes of aggregates) from this query-pool. We developed a stratified sampling technique along with query reissuing to guide the process of finding exceptions. We presented a comprehensive set of experiments that demonstrate the superiority of our approach over the baseline solutions on real-world datasets.

References

  • [1] Agrawal, R. and Srikant, R. (1994). Fast algorithms for mining association rules in large databases. Proceedings of the 20th international conference on Very Large Data Bases (VLDB’94) (p./pp. 478–499), September, : Morgan Kaufmann.
  • [2] Dasgupta, A., Das, G. and Mannila, H. (2007). A random walk approach to sampling hidden databases. Proceedings of the 2007 ACM SIGMOD international conference on Management of data (p./pp. 629–640), New York, NY, USA: ACM. ISBN: 978-1-59593-686-8
  • [3] Liu, W., Thirumuruganathan, S., Zhang, N. and Das, G. (2014). Aggregate Estimation Over Dynamic Hidden Web Databases.. PVLDB, 7, 1107-1118.
  • [4] Chaudhuri, S., Das, G. and Narasayya, V. R. (2007). Optimized stratified sampling for approximate query processing.. ACM Trans. Database Syst., 32, 9.
  • [5] Zhang, Z., He, B. and Chang, K. C.-C. (2004). Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax. Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD 2004) (p./pp. 107-118), .
  • [6] Kabisch, T., Dragut, E. C., Yu, C. T. and Leser, U. (2009). A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration.. PVLDB, 2, 325-336.
  • [7] Raghavan, S. and Molina, H. G. (2001). Crawling the hidden web. Proceedings of the 27th International Conference on Very Large Databases (VLDB 2001) (p./pp. 129-138), .
  • [8] Dasgupta, A., Jin, X., Jewell, B., Zhang, N. and Das, G. (2010). Unbiased estimation of size and other aggregates over hidden web databases.. In A. K. Elmagarmid and D. Agrawal (eds.), SIGMOD Conference (p./pp. 855-866), : ACM. ISBN: 978-1-4503-0032-2
  • [9] Chaudhuri, S., Das, G. and Narasayya, V. R. (2001). A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries.. In S. Mehrotra and T. K. Sellis (eds.), SIGMOD Conference (p./pp. 295-306), : ACM. ISBN: 1-58113-332-4
  • [10] Hellerstein, J. M., Haas, P. J. and Wang, H. J. (1997). Online Aggregation.. In J. Peckham (ed.), ACMSIGMOD International Conference on Management of Data (p./pp. 171-182), May, Tucson: ACM Press.
  • [11] Vitter, J. S. and Wang, M. (1999). Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets.. In A. Delis, C. Faloutsos and S. Ghandeharizadeh (eds.), SIGMOD Conference (p./pp. 193-204), : ACM Press. ISBN: 1-58113-084-8
  • [12] Sarawagi, S., Agrawal, R. and Megiddo, N. (1998). Discovery-Driven Exploration of OLAP Data Cubes.. In H.-J. Schek, F. Saltor, I. Ramos and G. Alonso (eds.), EDBT (p./pp. 168-182), : Springer. ISBN: 3-540-64264-1
  • [13] Agarwal, D., Barman, D., Gunopulos, D., Young, N. E., Korn, F. and Srivastava, D. (2007). Efficient and effective explanation of change in hierarchical summaries.. In P. Berkhin, R. Caruana and X. Wu (eds.), KDD (p./pp. 6-15), : ACM. ISBN: 978-1-59593-609-7
  • [14] Ramaswamy, S., Rastogi, R. and Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data (p./pp. 427–438), New York, NY, USA: ACM. ISBN: 1-58113-217-4
  • [15] Chawathe, S. S. and Garcia-Molina, H. (1997). Meaningful Change Detection in Structured Data. (p./pp. 26-37), May, Tuscon, Arizona
  • [16] Acharya, S., Gibbons, P. B., Poosala, V. and Ramaswamy, S. (1999). Join Synopses for Approximate Query Answering.. In A. Delis, C. Faloutsos and S. Ghandeharizadeh (eds.), SIGMOD Conference (p./pp. 275-286), : ACM Press. ISBN: 1-58113-084-8
  • [17] Sarawagi, S. (1999). Explaining Differences in Multidimensional Aggregates.. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik and M. L. Brodie (eds.), VLDB (p./pp. 42-53), : Morgan Kaufmann. ISBN: 1-55860-615-7
  • [18] Cho, J. and Ntoulas, A. (2002). Effective Change Detection Using Sampling. Proceedings of the 28th International Conference on Very Large Databases (VLDB 2002) (p./pp. 514–525), .
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
55800
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description